Delivery Standardization - MSBA Capstone 2025

Author

Kleyton R. Polzonoff

Published

April 18, 2025

Important Note To display the code, click the “Code” button in the body of the document or click the </> Code button at the top right, then select “Show All Code.”


1. Business Problem Statement and Objectives

The client, a major beverage supplier, needs a structured system to optimize logistics between its own fleet of Red Trucks and alternative delivery methods (ARTM), which include partner trucks and third-party carriers known as White Trucks. Red Trucks enhance customer relationships and contribute to revenue, while ARTM offers flexibility but limits interaction and control.

To ensure high-quality service and cost efficiency, I will establish clear fleet allocation guidelines based on customer profiles, transaction data, addresses, and delivery costs. This approach will determine the optimal truck type for each customer using a well-defined annual volume threshold. Additionally, customer segmentation will identify shared characteristics, enabling more strategic and data-driven decision-making.

Based on these insights, I will provide actionable recommendations to optimize fleet allocation and enhance operational efficiency.

2. Analytical Approach and Deliveries

The analysis will be conducted separately for two customer groups:

  • All Customers – The broader customer base, including those who purchase various product types.
  • Local Market Partners Buying Fountain Only – Customers who purchase only fountain drinks, excluding CO2, cans, or bottles.

To address logistics challenges and transform decisions into data-driven solutions, the approach will combine predictive models with clustering techniques, using both supervised and unsupervised learning methods to build a structured and efficient logistics framework.

Supervised Learning

Supervised learning techniques will be applied to determine whether each customer should be served by Red Trucks (our own fleet) or White Trucks (ARTM), based on a defined set of criteria.

Refining the Fleet Allocation Strategy: Initial Assumptions

As the dataset does not provide predefined fleet allocation criteria, I will establish initial reference points to guide this analysis:

  • Annual Volume Threshold: Customers receiving 400 cases and/or gallons per year will initially be assigned to the Red Truck fleet, while those below this threshold will be served by White Trucks.

Unsupervised Learning - Customer Segmentation

A clustering analysis will identify customer groups with similar consumption patterns, refining fleet allocation and enhancing decision-making rules.

Cost Impact Analysis

Considering that the delivery cost of white trucks is five times lower than that of red trucks, different logistics scenarios will be analyzed to compare the costs and strategic impacts of the current approach with the final recommended strategy.

Recommendations

Based on the analysis, data-driven recommendations will be provided to optimize fleet allocation, with a focus on improving service quality, cost efficiency, and strategic decision-making. This approach ensures that each customer receives the most suitable delivery method.

Description of the Data

This project will use four data files provided by the company:

  1. customer_profile.csv
  2. transactional_data.csv
  3. customer_address_and_zip_mapping.csv
  4. delivery_cost_data.xlsx
  • The customer profile data includes information on all customers they deliver to. This file contains each customer’s unique ID along with various categorical variables that describe their location, industry, and delivery preferences.

  • The transactional data contains all transactions from all customers with the ordered and delivered amount of product measured in cases and gallons.

  • The customer address file only contains two columns – zip code and full address. This can be used in tandem with the customer profile data.

  • The delivery cost data maps the cost of delivering a product based on different criteria. This will be used with the transaction data to find the cost of each transaction.

3 Exploratory Data Analysis (EDA) - Part I

This section analyzes the provided data to identify solutions, with a focus on completeness, consistency, and potential issues. Data transformations may include the creation of new variables to improve model accuracy. Given the large number of variables, the most relevant ones will be prioritized to ensure clarity, while less relevant analyses will be excluded to avoid information overload.

3.1 Loading and Cleaning Datasets

Code
# Load the profile CSV
profile_data <- read.csv(file = "customer_profile.csv", 
                         sep = ",", 
                         stringsAsFactors = FALSE)

# Load the address CSV
customer_address <- read.csv(file = "customer_address_and_zip_mapping.csv", 
                             sep = ",", 
                             stringsAsFactors = FALSE)

# Load the transactional CSV
op_data <- read.csv(file = "transactional_data.csv", 
                           sep = ",", 
                           stringsAsFactors = FALSE)

Missing data assessments and any substitutions or modifications will be carried out and will be included in the provided R Markdown file. However, some of these actions will not be displayed in this report to avoid content overload.

Profile Dataset - Cleaning and Adjustments

  • The number of unique CUSTOMER_NUMBER in profile_data is greater than in transactional data. This will be addressed later before merging the datasets.
  • There are no duplicates or missing values for CUSTOMER_NUMBER.
  • Date variables were adjusted to the proper format.
  • Logical variables were converted to integers, where 0 represents false and 1 represents true.
  • Special characters and extra spaces were removed in factor variables.
  • Missing values in the PRIMARY_GROUP_NUMBER field were replaced with zero.
  • The CHAIN_MEMBER variable was created to indicate whether the outlet belongs to a chain (has a PRIMARY_GROUP_NUMBER). A value of 1 represents a member, and 0 represents a non-member.

Customer Address Dataset - Cleaning and Adjustments

  • The address was split into new columns for each component.

  • The dataset does not contain customers’ actual addresses but will be used for data aggregation to support customer segmentation. It includes 145 rows with identical geographic coordinates; however, no ZIP codes are duplicated.

Transactional Dataset - Cleaning and Adjustments

  • 11,131 null values in the ORDER_TYPE column were replaced with “OTHER.”

  • The DAYS_AFTER column was added to track the number of days since the transaction, up to February 2, 2025.

  • 483 rows with zero values in ORDERED, LOADED, and DELIVERED CASES and GALLONS will be removed from the dataset.

  • Negative values in DELIVERED_CASES and DELIVERED_GALLONS have been moved to new columns (RETURNED_CASES and RETURNED_GALLONS), and the original columns were set to zero.

  • 30,965 transactions are related to order and/or load but do not have delivery or return data. These will be classified as “order_load” in the DLV_TYPE column.

3.2 Combined Dataset Driven by Transactions

During the exploration, combining all available data was identified as the most effective approach for subsequent analyses. Two files were created: one preserving individual transactions and another compiling information by customer. Both will be used in the exploratory data analysis.

The profile data contains exactly 1801 unique ZIP codes, which were merged with the same number of unique ZIP codes from the customer address dataset. It is important to note that some ZIP codes share the same geographic coordinates, reducing reliability in those cases.

As previously mentioned, the number of unique customer numbers in the profile data (now referred to as full data) is greater than in the transactions dataset. Only customers present in the transactions dataset were included in the merged data.

Code
# Merge customer_address with profile_data using ZIP_CODE
full_data <- profile_data %>%
  left_join(customer_address, by = c("ZIP_CODE" = "ZIP"))

# Check the number of unique CUSTOMER_NUMBER in full_data and op_data
print(length(unique(full_data$CUSTOMER_NUMBER)))
print(length(unique(op_data$CUSTOMER_NUMBER)))

# Filter full_data to keep only CUSTOMER_NUMBERs that are also in op_data, and merge with op_data
full_data <- full_data %>%
  filter(CUSTOMER_NUMBER %in% op_data$CUSTOMER_NUMBER) %>%
  left_join(op_data, by = "CUSTOMER_NUMBER")

Below are the first 5 rows and 6 columns of the combined dataset.

Code
# Display the first few rows of the combined dataset
head(full_data[, 1:6], 5)
  CUSTOMER_NUMBER PRIMARY_GROUP_NUMBER FREQUENT_ORDER_TYPE FIRST_DELIVERY_DATE
1       501556470                  376       MYCOKE LEGACY          2024-01-02
2       501556470                  376       MYCOKE LEGACY          2024-01-02
3       501556470                  376       MYCOKE LEGACY          2024-01-02
4       501556470                  376       MYCOKE LEGACY          2024-01-02
5       501556470                  376       MYCOKE LEGACY          2024-01-02
  ON_BOARDING_DATE COLD_DRINK_CHANNEL
1       2023-08-28             DINING
2       2023-08-28             DINING
3       2023-08-28             DINING
4       2023-08-28             DINING
5       2023-08-28             DINING

The variable LOCAL_FOUNT_ONLY will be created to identify whether the transaction’s customer belongs to the “Local Market Partners Buying Fountain Only” group—customers who purchase only fountain drinks, excluding CO2, cans, or bottles. It will be assigned a value of 1 if the customer belongs to this group and 0 otherwise.

Code
# Aggregate total delivered cases and gallons per customer
customer_summary <- full_data %>%
  group_by(CUSTOMER_NUMBER) %>%
  summarise(
    TOTAL_DELIVERED_CASES = sum(DELIVERED_CASES),
    TOTAL_DELIVERED_GALLONS = sum(DELIVERED_GALLONS),
    LOCAL_MARKET_PARTNER = max(LOCAL_MARKET_PARTNER),
    CO2_CUSTOMER = max(CO2_CUSTOMER),
    .groups = "drop")

# Classify customers based on aggregated values
customer_summary <- customer_summary %>%
  mutate(LOCAL_FOUNT_ONLY = case_when(
    LOCAL_MARKET_PARTNER == 1 & CO2_CUSTOMER == 0 & 
      TOTAL_DELIVERED_GALLONS > 0 & TOTAL_DELIVERED_CASES == 0 ~ 1L,
    TRUE ~ 0L))

# Merge back to original data
full_data <- full_data %>%
  left_join(dplyr::select(customer_summary, CUSTOMER_NUMBER, LOCAL_FOUNT_ONLY), by = "CUSTOMER_NUMBER")

# Remove temporary variables and data frames
rm(customer_summary)

The code below will create a table for an initial overview of the customer types.

Code
# Aggregate data by LOCAL_FOUNT_ONLY
summary_data <- full_data %>%
  group_by(LOCAL_FOUNT_ONLY) %>%
  summarise(
    customers = n_distinct(CUSTOMER_NUMBER),  # Count unique customers
    transactions = n(),  # Count transactions for this group
    qtd_cas = sum(DELIVERED_CASES),  # Total delivered cases
    qtd_gal = sum(DELIVERED_GALLONS),  # Total delivered gallons
    total_qtd = sum(DELIVERED_CASES) + sum(DELIVERED_GALLONS),  # Total volume (cases + gallons)
    .groups = "drop"
  ) %>%
  mutate(
    pct_cust = customers / sum(customers) * 100,  
    pct_trans = transactions / nrow(full_data) * 100,  
    pct_qtd = total_qtd / sum(total_qtd) * 100,  
    pct_gal = qtd_gal / sum(qtd_gal) * 100  
  ) %>%
  rename(LFO = LOCAL_FOUNT_ONLY) %>%  # Rename the column
  mutate(
    # Formatting numbers with comma separator, rounding before formatting with commas
    customers = format(customers, big.mark = ",", scientific = FALSE),
    transactions = format(transactions, big.mark = ",", scientific = FALSE),
    qtd_cas = format(round(qtd_cas, 0), big.mark = ",", scientific = FALSE),  
    qtd_gal = format(round(qtd_gal, 0), big.mark = ",", scientific = FALSE),  
    total_qtd = format(round(total_qtd, 0), big.mark = ",", scientific = FALSE),  
    pct_cust = round(pct_cust, 1),
    pct_trans = round(pct_trans, 1),
    pct_qtd = round(pct_qtd, 1),
    pct_gal = round(pct_gal, 1)
  )

# Add the total row (LFO = "Total")
total_row <- summary_data %>%
  summarise(
    LFO = "Total",
    customers = sum(as.numeric(gsub(",", "", customers))),
    transactions = sum(as.numeric(gsub(",", "", transactions))),
    qtd_cas = sum(as.numeric(gsub(",", "", qtd_cas))),
    qtd_gal = sum(as.numeric(gsub(",", "", qtd_gal))),
    total_qtd = sum(as.numeric(gsub(",", "", total_qtd))),
    pct_cust = 100,
    pct_trans = 100,
    pct_qtd = 100,
    pct_gal = 100
  ) %>%
  mutate(
    customers = format(customers, big.mark = ",", scientific = FALSE),
    transactions = format(transactions, big.mark = ",", scientific = FALSE),
    qtd_cas = format(round(qtd_cas, 0), big.mark = ",", scientific = FALSE),
    qtd_gal = format(round(qtd_gal, 0), big.mark = ",", scientific = FALSE),
    total_qtd = format(round(total_qtd, 0), big.mark = ",", scientific = FALSE),
    pct_cust = round(pct_cust, 1),
    pct_trans = round(pct_trans, 1),
    pct_qtd = round(pct_qtd, 1),
    pct_gal = round(pct_gal, 1)
  )

# Convert LFO to character to ensure consistency
summary_data <- summary_data %>% mutate(LFO = as.character(LFO))
total_row <- total_row %>% mutate(LFO = as.character(LFO))

# Combine the summary data with the total row
combined_data <- bind_rows(summary_data, total_row)

# Reorder the columns
combined_data <- combined_data[, c("LFO", "customers", "pct_cust", "transactions", "pct_trans", "qtd_cas", "qtd_gal", "pct_gal", "total_qtd", "pct_qtd")]

# Create the combined table
combined_data %>%
  kable("html", escape = FALSE, align = "c") %>%
  kable_styling(full_width = F, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:10, width = "6em") %>%
  row_spec(0, bold = TRUE, color = "black", background = "lightgray") %>%  # Light gray header
  add_header_above(c("Local Market Partners Fountain Only (LFO) - Delivery Quantities Overview" = 10)) %>%
  kable_paper("striped", full_width = F)
Local Market Partners Fountain Only (LFO) - Delivery Quantities Overview
LFO customers pct_cust transactions pct_trans qtd_cas qtd_gal pct_gal total_qtd pct_qtd
0 28,961 95.5 1,013,652 97 26,434,079 9,086,878 94.1 35,520,957 98.4
1 1,359 4.5 31,405 3 0 573,314 5.9 573,314 1.6
Total 30,320 100.0 1,045,057 100 26,434,079 9,660,192 100.0 36,094,271 100.0
Code
# Remove temporary variables and data frames
rm(summary_data, total_row, combined_data)

Only 4.5% of customers are Local Market Partners who do not purchase CO2 and buy only fountain drinks (LFO = 1), accounting for 3% of transactions. They consumed 5.9% of delivered gallons but represent just 1.9% of the total volume (cases + gallons).

This small group of 1,359 customers includes 83 transactions with positive ordered cases. The last order was placed on December 19, 2024, which would allow for some case deliveries to appear in transactions. Since this didn’t occur, these customers will be classified as part of the LFO group, as they consume fountain drinks (gallons), despite ordering cases.

3.3 Combined Dataset Driven by Outlets

The information from the combined transaction dataset (full_data) will now be merged by customer and named full_data_customer. The goal is to create a unique list of customers who have made transactions. This file will contain a large number of columns and will be used for further analysis.

Code
# Creating the YEAR_MONTH column to identify the periods
full_data <- full_data %>%
  mutate(YEAR_MONTH = format(as.Date(TRANSACTION_DATE), "%Y_%m"))

# Function to count transactions by period
count_transactions <- function(df, value_column, prefix) {
  df %>%
    group_by(CUSTOMER_NUMBER, YEAR_MONTH) %>%
    summarise(value_count = sum(!!sym(value_column) > 0, na.rm = TRUE), .groups = "drop") %>%
    pivot_wider(names_from = YEAR_MONTH, values_from = value_count, names_prefix = prefix, values_fill = list(value_count = 0))
}

# Counting transactions for each metric
trans_ordered_cases <- count_transactions(full_data, "ORDERED_CASES", "TRANS_ORD_CA_")
trans_ordered_gallons <- count_transactions(full_data, "ORDERED_GALLONS", "TRANS_ORD_GAL_")
trans_delivered_cases <- count_transactions(full_data, "DELIVERED_CASES", "TRANS_DLV_CA_")
trans_delivered_gallons <- count_transactions(full_data, "DELIVERED_GALLONS", "TRANS_DLV_GAL_")
trans_returned_cases <- count_transactions(full_data, "RETURNED_CASES", "TRANS_RET_CA_")
trans_returned_gallons <- count_transactions(full_data, "RETURNED_GALLONS", "TRANS_RET_GAL_")

# Function to sum the values by period
sum_transactions <- function(df, value_column, prefix) {
  df %>%
    group_by(CUSTOMER_NUMBER, YEAR_MONTH) %>%
    summarise(value_sum = sum(!!sym(value_column), na.rm = TRUE), .groups = "drop") %>%
    pivot_wider(names_from = YEAR_MONTH, values_from = value_sum, names_prefix = prefix, values_fill = list(value_sum = 0))
}

# Summing transactions for each metric
qtd_ordered_cases <- sum_transactions(full_data, "ORDERED_CASES", "QTD_ORD_CA_")
qtd_ordered_gallons <- sum_transactions(full_data, "ORDERED_GALLONS", "QTD_ORD_GAL_")
qtd_delivered_cases <- sum_transactions(full_data, "DELIVERED_CASES", "QTD_DLV_CA_")
qtd_delivered_gallons <- sum_transactions(full_data, "DELIVERED_GALLONS", "QTD_DLV_GAL_")
qtd_returned_cases <- sum_transactions(full_data, "RETURNED_CASES", "QTD_RET_CA_")
qtd_returned_gallons <- sum_transactions(full_data, "RETURNED_GALLONS", "QTD_RET_GAL_")

# Ensure the columns in column_order are present in full_data
column_order <- c("CUSTOMER_NUMBER", "PRIMARY_GROUP_NUMBER", "FREQUENT_ORDER_TYPE", 
                  "FIRST_DELIVERY_DATE", "ON_BOARDING_DATE", "LOCAL_FOUNT_ONLY","COLD_DRINK_CHANNEL", 
                  "TRADE_CHANNEL", "SUB_TRADE_CHANNEL", "LOCAL_MARKET_PARTNER", 
                  "CO2_CUSTOMER", "ZIP_CODE", "CHAIN_MEMBER", "CITY", "STATE", 
                  "COUNTY", "REGION", "LATITUDE", "LONGITUDE")

# Check if all columns exist in full_data
missing_cols <- setdiff(column_order, colnames(full_data))
if (length(missing_cols) > 0) {
  stop("The following columns are missing in full_data: ", paste(missing_cols, collapse = ", "))
}

# Count the number of transactions per customer
trans_count <- full_data %>%
  group_by(CUSTOMER_NUMBER) %>%
  summarise(TRANSACTIONS_DATE_COUNT = n(), .groups = "drop")

# Joining the data with the required columns in the desired order
full_data_customer <- distinct(full_data[, column_order]) %>%
  left_join(trans_count, by = "CUSTOMER_NUMBER") %>%
  left_join(trans_ordered_cases, by = "CUSTOMER_NUMBER") %>%
  left_join(trans_ordered_gallons, by = "CUSTOMER_NUMBER") %>%
  left_join(trans_delivered_cases, by = "CUSTOMER_NUMBER") %>%
  left_join(trans_delivered_gallons, by = "CUSTOMER_NUMBER") %>%
  left_join(trans_returned_cases, by = "CUSTOMER_NUMBER") %>%
  left_join(trans_returned_gallons, by = "CUSTOMER_NUMBER") %>%
  left_join(qtd_ordered_cases, by = "CUSTOMER_NUMBER") %>%
  left_join(qtd_ordered_gallons, by = "CUSTOMER_NUMBER") %>%
  left_join(qtd_delivered_cases, by = "CUSTOMER_NUMBER") %>%
  left_join(qtd_delivered_gallons, by = "CUSTOMER_NUMBER") %>%
  left_join(qtd_returned_cases, by = "CUSTOMER_NUMBER") %>%
  left_join(qtd_returned_gallons, by = "CUSTOMER_NUMBER")

# Rename Order Types
full_data <- full_data %>%
  mutate(ORDER_TYPE = dplyr::recode(ORDER_TYPE, 
                             "CALL CENTER" = "CALL.CENTER",
                             "MYCOKE LEGACY" = "MYCOKE.LEGACY",
                             "SALES REP" = "SALES.REP"))

# Count transactions by ORDER_TYPE
order_type_count <- full_data %>%
  group_by(CUSTOMER_NUMBER, ORDER_TYPE) %>%
  summarise(order_type_count = n(), .groups = "drop") %>%
  pivot_wider(names_from = ORDER_TYPE, values_from = order_type_count, names_prefix = "OT_", values_fill = list(order_type_count = 0))

# Count transactions by DLV_TYPE
dlv_type_count <- full_data %>%
  group_by(CUSTOMER_NUMBER, DLV_TYPE) %>%
  summarise(dlv_type_count = n(), .groups = "drop") %>%
  pivot_wider(names_from = DLV_TYPE, values_from = dlv_type_count, names_prefix = "DLVT_", values_fill = list(dlv_type_count = 0))

# Join with the full_data_customer to ensure ORDER_TYPE and DLV_TYPE columns are added
full_data_customer <- full_data_customer %>%
  left_join(order_type_count, by = "CUSTOMER_NUMBER") %>%
  left_join(dlv_type_count, by = "CUSTOMER_NUMBER")

# Adding the requested summary columns
full_data_customer <- full_data_customer %>%
  mutate(TOTAL_CASES_ORDERED = rowSums(full_data_customer[, grep("^QTD_ORD_CA_", names(full_data_customer))]),
         TOTAL_CASES_DELIVERED = rowSums(full_data_customer[, grep("^QTD_DLV_CA_", names(full_data_customer))]),
         TOTAL_GALLONS_ORDERED = rowSums(full_data_customer[, grep("^QTD_ORD_GAL_", names(full_data_customer))]),
         TOTAL_GALLONS_DELIVERED = rowSums(full_data_customer[, grep("^QTD_DLV_GAL_", names(full_data_customer))]),
         TOTAL_CASES_RETURNED = rowSums(full_data_customer[, grep("^QTD_RET_CA_", names(full_data_customer))]),
         TOTAL_GALLONS_RETURNED = rowSums(full_data_customer[, grep("^QTD_RET_GAL_", names(full_data_customer))]))

# Ensuring column order
ot_columns <- colnames(order_type_count)[-1]
dlvt_columns <- colnames(dlv_type_count)[-1]
summary_columns <- c("TOTAL_CASES_ORDERED", "TOTAL_CASES_DELIVERED", "TOTAL_GALLONS_ORDERED", "TOTAL_GALLONS_DELIVERED", "TOTAL_CASES_RETURNED", "TOTAL_GALLONS_RETURNED")
transaction_columns <- grep("^TRANS_", colnames(full_data_customer), value = TRUE)
quantity_columns <- grep("^QTD_", colnames(full_data_customer), value = TRUE)
ordered_columns <- c(column_order, "TRANSACTIONS_DATE_COUNT", ot_columns, dlvt_columns, summary_columns, sort(transaction_columns), sort(quantity_columns))

# Reordering full_data_customer
full_data_customer <- full_data_customer[, ordered_columns]

# Replacing NAs with 0 in transaction and quantity columns
full_data_customer[is.na(full_data_customer)] <- 0

# Extra variables

# Define reference date
ref_date <- as.Date("2025-02-01")

# 1. DAYS_FIRST_DLV
full_data_customer$DAYS_FIRST_DLV <- as.numeric(difftime(ref_date, full_data_customer$FIRST_DELIVERY_DATE, units = "days"))

# 2. DAYS_ONBOARDING
full_data_customer$DAYS_ONBOARDING <- as.numeric(difftime(ref_date, full_data_customer$ON_BOARDING_DATE, units = "days"))

# 3. Average transactions per month
# Replace NA with 0 for missing transactions
full_data_customer[is.na(full_data_customer)] <- 0

# Calculate the average transaction per month
cols_to_average_dlv <- grep("^TRANS_DLV_CA", names(full_data_customer), value = TRUE)
full_data_customer[cols_to_average_dlv] <- lapply(full_data_customer[cols_to_average_dlv], as.numeric)
full_data_customer$AVG_TRANS_DLV_CA_M <- rowMeans(full_data_customer[, cols_to_average_dlv], na.rm = TRUE)

cols_to_average_gal <- grep("^TRANS_DLV_GAL", names(full_data_customer), value = TRUE)
full_data_customer[cols_to_average_gal] <- lapply(full_data_customer[cols_to_average_gal], as.numeric)
full_data_customer$AVG_TRANS_DLV_GAL_M <- rowMeans(full_data_customer[, cols_to_average_gal], na.rm = TRUE)

cols_to_average_ord_ca <- grep("^TRANS_ORD_CA", names(full_data_customer), value = TRUE)
full_data_customer[cols_to_average_ord_ca] <- lapply(full_data_customer[cols_to_average_ord_ca], as.numeric)
full_data_customer$AVG_TRANS_ORD_CA_M <- rowMeans(full_data_customer[, cols_to_average_ord_ca], na.rm = TRUE)

cols_to_average_ord_gal <- grep("^TRANS_ORD_GAL", names(full_data_customer), value = TRUE)
full_data_customer[cols_to_average_ord_gal] <- lapply(full_data_customer[cols_to_average_ord_gal], as.numeric)
full_data_customer$AVG_TRANS_ORD_GAL_M <- rowMeans(full_data_customer[, cols_to_average_ord_gal], na.rm = TRUE)

cols_to_average_ret_ca <- grep("^TRANS_RET_CA", names(full_data_customer), value = TRUE)
full_data_customer[cols_to_average_ret_ca] <- lapply(full_data_customer[cols_to_average_ret_ca], as.numeric)
full_data_customer$AVG_TRANS_RET_CA_M <- rowMeans(full_data_customer[, cols_to_average_ret_ca], na.rm = TRUE)

cols_to_average_ret_gal <- grep("^TRANS_RET_GAL", names(full_data_customer), value = TRUE)
full_data_customer[cols_to_average_ret_gal] <- lapply(full_data_customer[cols_to_average_ret_gal], as.numeric)
full_data_customer$AVG_TRANS_RET_GAL_M <- rowMeans(full_data_customer[, cols_to_average_ret_gal], na.rm = TRUE)

# 4. Number of transactions per year (sum annual columns)
full_data_customer$NUM_TRANS_ORD_CA_23 <- rowSums(full_data_customer[, grep("^TRANS_ORD_CA_2023", names(full_data_customer))], na.rm = TRUE)
full_data_customer$NUM_TRANS_ORD_CA_24 <- rowSums(full_data_customer[, grep("^TRANS_ORD_CA_2024", names(full_data_customer))], na.rm = TRUE)
full_data_customer$NUM_TRANS_ORD_GAL_23 <- rowSums(full_data_customer[, grep("^TRANS_ORD_GAL_2023", names(full_data_customer))], na.rm = TRUE)
full_data_customer$NUM_TRANS_ORD_GAL_24 <- rowSums(full_data_customer[, grep("^TRANS_ORD_GAL_2024", names(full_data_customer))], na.rm = TRUE)
full_data_customer$NUM_TRANS_DLV_CA_23 <- rowSums(full_data_customer[, grep("^TRANS_DLV_CA_2023", names(full_data_customer))], na.rm = TRUE)
full_data_customer$NUM_TRANS_DLV_CA_24 <- rowSums(full_data_customer[, grep("^TRANS_DLV_CA_2024", names(full_data_customer))], na.rm = TRUE)
full_data_customer$NUM_TRANS_DLV_GAL_23 <- rowSums(full_data_customer[, grep("^TRANS_DLV_GAL_2023", names(full_data_customer))], na.rm = TRUE)
full_data_customer$NUM_TRANS_DLV_GAL_24 <- rowSums(full_data_customer[, grep("^TRANS_DLV_GAL_2024", names(full_data_customer))], na.rm = TRUE)
full_data_customer$NUM_TRANS_RET_CA_23 <- rowSums(full_data_customer[, grep("^TRANS_RET_CA_2023", names(full_data_customer))], na.rm = TRUE)
full_data_customer$NUM_TRANS_RET_CA_24 <- rowSums(full_data_customer[, grep("^TRANS_RET_CA_2024", names(full_data_customer))], na.rm = TRUE)
full_data_customer$NUM_TRANS_RET_GAL_23 <- rowSums(full_data_customer[, grep("^TRANS_RET_GAL_2023", names(full_data_customer))], na.rm = TRUE)
full_data_customer$NUM_TRANS_RET_GAL_24 <- rowSums(full_data_customer[, grep("^TRANS_RET_GAL_2024", names(full_data_customer))], na.rm = TRUE)

# 5. Sum of quantities per year
full_data_customer$QTD_ORD_CA_2023 <- rowSums(full_data_customer[, grep("^QTD_ORD_CA_2023", names(full_data_customer))], na.rm = TRUE)
full_data_customer$QTD_ORD_GAL_2023 <- rowSums(full_data_customer[, grep("^QTD_ORD_GAL_2023", names(full_data_customer))], na.rm = TRUE)
full_data_customer$QTD_ORD_CA_2024 <- rowSums(full_data_customer[, grep("^QTD_ORD_CA_2024", names(full_data_customer))], na.rm = TRUE)
full_data_customer$QTD_ORD_GAL_2024 <- rowSums(full_data_customer[, grep("^QTD_ORD_GAL_2024", names(full_data_customer))], na.rm = TRUE)
full_data_customer$QTD_DLV_CA_2023 <- rowSums(full_data_customer[, grep("^QTD_DLV_CA_2023", names(full_data_customer))], na.rm = TRUE)
full_data_customer$QTD_DLV_GAL_2023 <- rowSums(full_data_customer[, grep("^QTD_DLV_GAL_2023", names(full_data_customer))], na.rm = TRUE)
full_data_customer$QTD_DLV_CA_2024 <- rowSums(full_data_customer[, grep("^QTD_DLV_CA_2024", names(full_data_customer))], na.rm = TRUE)
full_data_customer$QTD_DLV_GAL_2024 <- rowSums(full_data_customer[, grep("^QTD_DLV_GAL_2024", names(full_data_customer))], na.rm = TRUE)
full_data_customer$QTD_RET_CA_2023 <- rowSums(full_data_customer[, grep("^QTD_RET_CA_2023", names(full_data_customer))], na.rm = TRUE)
full_data_customer$QTD_RET_GAL_2023 <- rowSums(full_data_customer[, grep("^QTD_RET_GAL_2023", names(full_data_customer))], na.rm = TRUE)
full_data_customer$QTD_RET_CA_2024 <- rowSums(full_data_customer[, grep("^QTD_RET_CA_2024", names(full_data_customer))], na.rm = TRUE)
full_data_customer$QTD_RET_GAL_2024 <- rowSums(full_data_customer[, grep("^QTD_RET_GAL_2024", names(full_data_customer))], na.rm = TRUE)

# 6. Create new columns for CUST_23 and CUST_24
full_data_customer$ACTIVE_23 <- ifelse((full_data_customer$QTD_DLV_CA_2023 + full_data_customer$QTD_DLV_GAL_2023) > 0, 1, 0)
full_data_customer$ACTIVE_24 <- ifelse((full_data_customer$QTD_DLV_CA_2024 + full_data_customer$QTD_DLV_GAL_2024) > 0, 1, 0)

# Display the first few rows of the combined dataset
#head(full_data_customer)

3.4 Estimated Delivery Costs

The delivery costs will reflect estimated volumes, as they were provided based on the median price within volume ranges and by type of COLD_DRINK_CHANNEL.

Code
# Load the delivery cost data from the Excel file
cost_data <- read_excel("delivery_cost_data.xlsx")

# Convert 'Cold Drink Channel' to a factor
cost_data <- cost_data %>%
  mutate(COLD_DRINK_CHANNEL = factor(`Cold Drink Channel`))


# Manually recode 'Applicable To' values
cost_data <- cost_data %>%
  mutate(`Applicable To` = ifelse(`Applicable To` == "Bottles and Cans", "CASES", 
                                  ifelse(`Applicable To` == "Fountain", "GALLONS", `Applicable To`)))

# Create RANGE_LEVEL based on 'Vol Range' and make it a factor
cost_data$RANGE_LEVEL <- factor(case_when(
  cost_data$`Vol Range` == "0 - 149" & cost_data$`Applicable To` == "CASES" ~ "RANGE_1_CASES",
  cost_data$`Vol Range` == "150 - 299" & cost_data$`Applicable To` == "CASES" ~ "RANGE_2_CASES", 
  cost_data$`Vol Range` == "300 - 449" & cost_data$`Applicable To` == "CASES" ~ "RANGE_3_CASES",
  cost_data$`Vol Range` == "450 - 599" & cost_data$`Applicable To` == "CASES" ~ "RANGE_4_CASES",
  cost_data$`Vol Range` == "600 - 749" & cost_data$`Applicable To` == "CASES" ~ "RANGE_5_CASES",
  cost_data$`Vol Range` == "750 - 899" & cost_data$`Applicable To` == "CASES" ~ "RANGE_6_CASES",
  cost_data$`Vol Range` == "900 - 1049" & cost_data$`Applicable To` == "CASES" ~ "RANGE_7_CASES",
  cost_data$`Vol Range` == "1050 - 1199" & cost_data$`Applicable To` == "CASES" ~ "RANGE_8_CASES",
  cost_data$`Vol Range` == "1200 - 1349" & cost_data$`Applicable To` == "CASES" ~ "RANGE_9_CASES",
  cost_data$`Vol Range` == "1350+" & cost_data$`Applicable To` == "CASES" ~ "RANGE_10_CASES",
  
  cost_data$`Vol Range` == "0 - 149" & cost_data$`Applicable To` == "GALLONS" ~ "RANGE_1_GALLONS",
  cost_data$`Vol Range` == "150 - 299" & cost_data$`Applicable To` == "GALLONS" ~ "RANGE_2_GALLONS", 
  cost_data$`Vol Range` == "300 - 449" & cost_data$`Applicable To` == "GALLONS" ~ "RANGE_3_GALLONS",
  cost_data$`Vol Range` == "450 - 599" & cost_data$`Applicable To` == "GALLONS" ~ "RANGE_4_GALLONS",
  cost_data$`Vol Range` == "600 - 749" & cost_data$`Applicable To` == "GALLONS" ~ "RANGE_5_GALLONS",
  cost_data$`Vol Range` == "750 - 899" & cost_data$`Applicable To` == "GALLONS" ~ "RANGE_6_GALLONS",
  cost_data$`Vol Range` == "900 - 1049" & cost_data$`Applicable To` == "GALLONS" ~ "RANGE_7_GALLONS",
  cost_data$`Vol Range` == "1050 - 1199" & cost_data$`Applicable To` == "GALLONS" ~ "RANGE_8_GALLONS",
  cost_data$`Vol Range` == "1200 - 1349" & cost_data$`Applicable To` == "GALLONS" ~ "RANGE_9_GALLONS",
  cost_data$`Vol Range` == "1350+" & cost_data$`Applicable To` == "GALLONS" ~ "RANGE_10_GALLONS"
))

# Reorder columns to keep only the desired ones: COLD_DRINK_CHANNEL, VOL_RANGE, RANGE_LEVEL, MEDIAN DELIVERY COST
cost_data <- cost_data[, c("COLD_DRINK_CHANNEL", "Vol Range", "RANGE_LEVEL", "Median Delivery Cost")]

# Check the result
#head(cost_data)

The necessary variables will be created to calculate the delivery costs for cases and gallons for the years 2023 and 2024 by customer.

Code
# Create cost range columns for each year based on quantities
full_data_customer <- full_data_customer %>%
  mutate(
    # For 2023, categorize based on quantity ranges for cases
    COST_RANGE_CA_23 = case_when(
      QTD_DLV_CA_2023 >= 0 & QTD_DLV_CA_2023 < 150 ~ "RANGE_1_CASES", 
      QTD_DLV_CA_2023 >= 150 & QTD_DLV_CA_2023 < 300 ~ "RANGE_2_CASES",
      QTD_DLV_CA_2023 >= 300 & QTD_DLV_CA_2023 < 450 ~ "RANGE_3_CASES",
      QTD_DLV_CA_2023 >= 450 & QTD_DLV_CA_2023 < 600 ~ "RANGE_4_CASES",
      QTD_DLV_CA_2023 >= 600 & QTD_DLV_CA_2023 < 750 ~ "RANGE_5_CASES",
      QTD_DLV_CA_2023 >= 750 & QTD_DLV_CA_2023 < 900 ~ "RANGE_6_CASES",
      QTD_DLV_CA_2023 >= 900 & QTD_DLV_CA_2023 < 1050 ~ "RANGE_7_CASES",
      QTD_DLV_CA_2023 >= 1050 & QTD_DLV_CA_2023 < 1200 ~ "RANGE_8_CASES",
      QTD_DLV_CA_2023 >= 1200 & QTD_DLV_CA_2023 < 1350 ~ "RANGE_9_CASES",
      QTD_DLV_CA_2023 >= 1350 ~ "RANGE_10_CASES", 
      TRUE ~ NA_character_),
    
    # For 2024, categorize based on quantity ranges for cases
    COST_RANGE_CA_24 = case_when(
      QTD_DLV_CA_2024 >= 0 & QTD_DLV_CA_2024 < 150 ~ "RANGE_1_CASES",
      QTD_DLV_CA_2024 >= 150 & QTD_DLV_CA_2024 < 300 ~ "RANGE_2_CASES",
      QTD_DLV_CA_2024 >= 300 & QTD_DLV_CA_2024 < 450 ~ "RANGE_3_CASES",
      QTD_DLV_CA_2024 >= 450 & QTD_DLV_CA_2024 < 600 ~ "RANGE_4_CASES",
      QTD_DLV_CA_2024 >= 600 & QTD_DLV_CA_2024 < 750 ~ "RANGE_5_CASES",
      QTD_DLV_CA_2024 >= 750 & QTD_DLV_CA_2024 < 900 ~ "RANGE_6_CASES",
      QTD_DLV_CA_2024 >= 900 & QTD_DLV_CA_2024 < 1050 ~ "RANGE_7_CASES",
      QTD_DLV_CA_2024 >= 1050 & QTD_DLV_CA_2024 < 1200 ~ "RANGE_8_CASES",
      QTD_DLV_CA_2024 >= 1200 & QTD_DLV_CA_2024 < 1350 ~ "RANGE_9_CASES",
      QTD_DLV_CA_2024 >= 1350 ~ "RANGE_10_CASES",
      TRUE ~ NA_character_),
    
    # For 2023, categorize based on quantity ranges for gallons
    COST_RANGE_GAL_23 = case_when(
      QTD_DLV_GAL_2023 >= 0 & QTD_DLV_GAL_2023 < 150 ~ "RANGE_1_GALLONS", 
      QTD_DLV_GAL_2023 >= 150 & QTD_DLV_GAL_2023 < 300 ~ "RANGE_2_GALLONS",
      QTD_DLV_GAL_2023 >= 300 & QTD_DLV_GAL_2023 < 450 ~ "RANGE_3_GALLONS",
      QTD_DLV_GAL_2023 >= 450 & QTD_DLV_GAL_2023 < 600 ~ "RANGE_4_GALLONS",
      QTD_DLV_GAL_2023 >= 600 & QTD_DLV_GAL_2023 < 750 ~ "RANGE_5_GALLONS",
      QTD_DLV_GAL_2023 >= 750 & QTD_DLV_GAL_2023 < 900 ~ "RANGE_6_GALLONS",
      QTD_DLV_GAL_2023 >= 900 & QTD_DLV_GAL_2023 < 1050 ~ "RANGE_7_GALLONS",
      QTD_DLV_GAL_2023 >= 1050 & QTD_DLV_GAL_2023 < 1200 ~ "RANGE_8_GALLONS",
      QTD_DLV_GAL_2023 >= 1200 & QTD_DLV_GAL_2023 < 1350 ~ "RANGE_9_GALLONS",
      QTD_DLV_GAL_2023 >= 1350 ~ "RANGE_10_GALLONS", 
      TRUE ~ NA_character_),
    
    # For 2024, categorize based on quantity ranges for gallons
    COST_RANGE_GAL_24 = case_when(
      QTD_DLV_GAL_2024 >= 0 & QTD_DLV_GAL_2024 < 150 ~ "RANGE_1_GALLONS",
      QTD_DLV_GAL_2024 >= 150 & QTD_DLV_GAL_2024 < 300 ~ "RANGE_2_GALLONS",
      QTD_DLV_GAL_2024 >= 300 & QTD_DLV_GAL_2024 < 450 ~ "RANGE_3_GALLONS",
      QTD_DLV_GAL_2024 >= 450 & QTD_DLV_GAL_2024 < 600 ~ "RANGE_4_GALLONS",
      QTD_DLV_GAL_2024 >= 600 & QTD_DLV_GAL_2024 < 750 ~ "RANGE_5_GALLONS",
      QTD_DLV_GAL_2024 >= 750 & QTD_DLV_GAL_2024 < 900 ~ "RANGE_6_GALLONS",
      QTD_DLV_GAL_2024 >= 900 & QTD_DLV_GAL_2024 < 1050 ~ "RANGE_7_GALLONS",
      QTD_DLV_GAL_2024 >= 1050 & QTD_DLV_GAL_2024 < 1200 ~ "RANGE_8_GALLONS",
      QTD_DLV_GAL_2024 >= 1200 & QTD_DLV_GAL_2024 < 1350 ~ "RANGE_9_GALLONS",
      QTD_DLV_GAL_2024 >= 1350 ~ "RANGE_10_GALLONS",
      TRUE ~ NA_character_ ))


# First join for UNIT_COST_CA_23
full_data_customer <- full_data_customer %>%
  left_join(cost_data %>% dplyr::select(COLD_DRINK_CHANNEL, RANGE_LEVEL, `Median Delivery Cost`), 
            by = c("COLD_DRINK_CHANNEL" = "COLD_DRINK_CHANNEL", 
                   "COST_RANGE_CA_23" = "RANGE_LEVEL")) %>%
  mutate(UNIT_COST_CA_23 = `Median Delivery Cost`) %>%
  dplyr::select(-`Median Delivery Cost`)  # Remove unwanted column

# Second join for UNIT_COST_CA_24
full_data_customer <- full_data_customer %>%
  left_join(cost_data %>% dplyr::select(COLD_DRINK_CHANNEL, RANGE_LEVEL, `Median Delivery Cost`), 
            by = c("COLD_DRINK_CHANNEL" = "COLD_DRINK_CHANNEL", 
                   "COST_RANGE_CA_24" = "RANGE_LEVEL")) %>%
  mutate(UNIT_COST_CA_24 = `Median Delivery Cost`) %>%
  dplyr::select(-`Median Delivery Cost`)  # Remove unwanted column

# Third join for UNIT_COST_GAL_23
full_data_customer <- full_data_customer %>%
  left_join(cost_data %>% dplyr::select(COLD_DRINK_CHANNEL, RANGE_LEVEL, `Median Delivery Cost`), 
            by = c("COLD_DRINK_CHANNEL" = "COLD_DRINK_CHANNEL", 
                   "COST_RANGE_GAL_23" = "RANGE_LEVEL")) %>%
  mutate(UNIT_COST_GAL_23 = `Median Delivery Cost`) %>%
  dplyr::select(-`Median Delivery Cost`)  # Remove unwanted column

# Fourth join for UNIT_COST_GAL_24
full_data_customer <- full_data_customer %>%
  left_join(cost_data %>% dplyr::select(COLD_DRINK_CHANNEL, RANGE_LEVEL, `Median Delivery Cost`), 
            by = c("COLD_DRINK_CHANNEL" = "COLD_DRINK_CHANNEL", 
                   "COST_RANGE_GAL_24" = "RANGE_LEVEL")) %>%
  mutate(UNIT_COST_GAL_24 = `Median Delivery Cost`) %>%
  dplyr::select(-`Median Delivery Cost`)  # Remove unwanted column

# Calculating delivery costs for each year and drink type
full_data_customer <- full_data_customer %>%
  mutate(
    COST_CA_23 = QTD_DLV_CA_2023 * UNIT_COST_CA_23,
    COST_CA_24 = QTD_DLV_CA_2024 * UNIT_COST_CA_24,
    COST_GAL_23 = QTD_DLV_GAL_2023 * UNIT_COST_GAL_23,
    COST_GAL_24 = QTD_DLV_GAL_2024 * UNIT_COST_GAL_24 )

# Format unit costs and costs to two decimal places
full_data_customer <- full_data_customer %>%
  mutate(
    UNIT_COST_CA_23 = round(UNIT_COST_CA_23, 8),
    UNIT_COST_CA_24 = round(UNIT_COST_CA_24, 8),
    UNIT_COST_GAL_23 = round(UNIT_COST_GAL_23, 8),
    UNIT_COST_GAL_24 = round(UNIT_COST_GAL_24, 8),
    COST_CA_23 = round(COST_CA_23, 8),
    COST_CA_24 = round(COST_CA_24, 8),
    COST_GAL_23 = round(COST_GAL_23, 8),
    COST_GAL_24 = round(COST_GAL_24, 8))

The table below presents the information that constitutes the calculation of the delivery cost per customer.

Code
# Costs table
summary_table <- as.data.table(full_data_customer)[, .(
  CUSTOMER_NUMBER, COLD_DRINK_CHANNEL,
  QTD_DLV_CA_2023 = round(QTD_DLV_CA_2023, 0), 
  QTD_DLV_CA_2024 = round(QTD_DLV_CA_2024, 0), 
  QTD_DLV_GAL_2023 = round(QTD_DLV_GAL_2023, 0), 
  QTD_DLV_GAL_2024 = round(QTD_DLV_GAL_2024, 0),
  COST_RANGE_CA_23, COST_RANGE_CA_24, COST_RANGE_GAL_23, COST_RANGE_GAL_24,
  UNIT_COST_CA_23 = round(UNIT_COST_CA_23, 2), 
  UNIT_COST_CA_24 = round(UNIT_COST_CA_24, 2), 
  UNIT_COST_GAL_23 = round(UNIT_COST_GAL_23, 2), 
  UNIT_COST_GAL_24 = round(UNIT_COST_GAL_24, 2),
  COST_CA_23 = round(COST_CA_23, 2), 
  COST_CA_24 = round(COST_CA_24, 2), 
  COST_GAL_23 = round(COST_GAL_23, 2), 
  COST_GAL_24 = round(COST_GAL_24, 2))]

# Display the table interactively
datatable(summary_table, options = list(
pageLength = 5, scrollX = TRUE, scrollY = TRUE))

All costs are being calculated correctly. At this moment, percentage variations for the number of operations, demands, and costs have not been generated because not all customers have a history for 2023 and 2024, which prevents such calculations. However, methods to quantify the growth of each customer will be identified later.

3.5 Target Variables: Initial Assumptions

As initially explained, we will establish classifications related to the target variables to create an initial reference point.

3.5.1 - Demand Threshold and Fleet Assingment

The average annual consumption per customer will be calculated and customers will be classified based on whether they exceed the threshold of 400 units (cases plus gallons).

Code
# Calculating the average
full_data_customer$AVG_ANNUAL_CONSUMP <- round((full_data_customer$QTD_DLV_CA_GAL_2023 + full_data_customer$QTD_DLV_CA_GAL_2024) / 2, 1)

# Creating the THRESHOLD_REACH variable
full_data_customer$THRESHOLD_REACH <- ifelse(full_data_customer$AVG_ANNUAL_CONSUMP < 400, 0, 1)

# Summarize data by THRESHOLD_REACH
data_threshold_reach <- full_data_customer %>%
  group_by(THRESHOLD_REACH) %>%
  summarise(CustomerCount = n(), .groups = 'drop') %>%
  mutate(Percentage = round(CustomerCount / sum(CustomerCount) * 100, 1))  # Calculate percentage

# Display the table
kable(data_threshold_reach, col.names = c("Threshold Reach", "Customer Count", "Percentage (%)"), format = "simple")
Threshold Reach Customer Count Percentage (%)
0 23081 76.1
1 7239 23.9

About 23,081 (76%) of all customers did not reach the threshold of 400 gallons on average per year, while the remaining 7,239 did.

Customers who exceed 400 units annually will be assigned to Red Trucks, while the remaining customers will be allocated to White Trucks.

Code
# Create the FLEET_TYPE column based on THRESHOLD_REACH only
full_data_customer$FLEET_TYPE <- ifelse(full_data_customer$THRESHOLD_REACH == 1, 
                                         "RED TRUCK", 
                                         "WHITE TRUCK")

# Group and calculate the number of customers by FLEET_TYPE and LOCAL_FOUNT_ONLY
summary_fleet_type <- full_data_customer %>%
  group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%
  summarise(
    total_customers = n(),
    .groups = "drop"
  )

# Calculate percentage of customers within each LOCAL_FOUNT_ONLY group separately
summary_fleet_type <- summary_fleet_type %>%
  group_by(LOCAL_FOUNT_ONLY) %>%
  mutate(
    pct_customers = total_customers / sum(total_customers) * 100  # Calculate the percentage within each LOCAL_FOUNT_ONLY group
  )

# Transform data into long format for percentages
summary_fleet_type_long <- summary_fleet_type %>%
  pivot_longer(
    cols = starts_with("pct_"),
    names_to = "metric",
    values_to = "percentage"
  ) %>%
  mutate(
    metric = factor(metric, 
                    levels = c("pct_customers"),
                    labels = c("Percentage of Customers"))
  )

# Ensure LOCAL_FOUNT_ONLY and FLEET_TYPE are factors
summary_fleet_type_long$LOCAL_FOUNT_ONLY <- factor(summary_fleet_type_long$LOCAL_FOUNT_ONLY, levels = c("0", "1"))
summary_fleet_type_long$FLEET_TYPE <- factor(summary_fleet_type_long$FLEET_TYPE, levels = c("RED TRUCK", "WHITE TRUCK"))

# Plot for percentages with FLEET_TYPE as colors and LOCAL_FOUNT_ONLY as groups
ggplot(summary_fleet_type_long, aes(x = LOCAL_FOUNT_ONLY, y = percentage, fill = FLEET_TYPE)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.6) +
  geom_text(aes(label = scales::comma(percentage, suffix = "%")), 
            position = position_dodge(width = 0.8), vjust = 0.2, size = 3.5) +
  labs(title = "Percentage of Customers by Fleet Type and Local Fountain Only") +
  scale_fill_manual(values = c("RED TRUCK" = "#B33951", "WHITE TRUCK" = "#D3D3D3")) +  # Set colors for RED and WHITE TRUCK
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    legend.title = element_blank(),  # Remove legend title
    legend.position = "right",  # Position legend on the right side
    legend.box = "vertical",  # Ensure vertical arrangement for the legend
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    strip.text = element_text(size = 10, face = "bold"),
    strip.background = element_blank(),
    axis.text.x = element_text(size = 10),
    panel.spacing = unit(1, "lines"),
    strip.text.y = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  scale_x_discrete(labels = c("0" = "Others", "1" = "Local Fountain Only")) +
  guides(fill = guide_legend(title = "Fleet Type"))  # Add a legend title

Code
# Group and calculate the number of customers by FLEET_TYPE and LOCAL_FOUNT_ONLY
summary_fleet_type_count <- full_data_customer %>%
  group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%
  summarise(
    total_customers = n(),
    .groups = "drop"
  )

# Display the summary with the count of customers by fleet type and LOCAL_FOUNT_ONLY
#summary_fleet_type_count

According to these criteria, 13% of Local Fountain Only customers would be assigned to RED TRUCK. Among the other customers, 24% would receive deliveries via RED TRUCK.

Code
# Group by LOCAL_FOUNT_ONLY and FLEET_TYPE, then calculate the total delivered volume (QTD_DLV_TOTAL)
summary_fleet_type_total <- full_data_customer %>%
  group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%
  summarise(
    total_QTD_DLV = sum(QTD_DLV_TOTAL, na.rm = TRUE),  # Sum of QTD_DLV_TOTAL for each FLEET_TYPE and LOCAL_FOUNT_ONLY
    .groups = "drop"
  )

# Ensure LOCAL_FOUNT_ONLY and FLEET_TYPE are factors for better plotting
summary_fleet_type_total$LOCAL_FOUNT_ONLY <- factor(summary_fleet_type_total$LOCAL_FOUNT_ONLY, levels = c("0", "1"))
summary_fleet_type_total$FLEET_TYPE <- factor(summary_fleet_type_total$FLEET_TYPE, levels = c("RED TRUCK", "WHITE TRUCK"))

# Plot the total delivered volume (QTD_DLV_TOTAL) by FLEET_TYPE and LOCAL_FOUNT_ONLY
ggplot(summary_fleet_type_total, aes(x = LOCAL_FOUNT_ONLY, y = total_QTD_DLV, fill = FLEET_TYPE)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.6) +
  geom_text(aes(label = scales::comma(total_QTD_DLV)), 
            position = position_dodge(width = 0.8), vjust = 0.0, size = 3.5) +
  labs(title = "Total Delivered Volume by Fleet Type and Local Fountain Only (23 & 24)") +
  scale_fill_manual(values = c("RED TRUCK" = "#B33951", "WHITE TRUCK" = "#D3D3D3")) +  # Set colors for RED and WHITE TRUCK
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    legend.title = element_blank(),  # Remove legend title
    legend.position = "right",  # Position legend on the right side
    legend.box = "vertical",  # Ensure vertical arrangement for the legend
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    strip.text = element_text(size = 10, face = "bold"),
    strip.background = element_blank(),
    axis.text.x = element_text(size = 10),
    panel.spacing = unit(1, "lines"),
    strip.text.y = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  scale_x_discrete(labels = c("0" = "Others", "1" = "Local Fountain Only")) +
  guides(fill = guide_legend(title = "Fleet Type"))  # Add a legend title

Code
# Group and calculate the total delivered volume (QTD_DLV_TOTAL) by FLEET_TYPE and LOCAL_FOUNT_ONLY
summary_fleet_type_count <- full_data_customer %>%
  group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%
  summarise(
    total_QTD_DLV = sum(QTD_DLV_TOTAL, na.rm = TRUE),
    .groups = "drop"
  )

# Display the summary with the total delivered volume by FLEET_TYPE and LOCAL_FOUNT_ONLY
#summary_fleet_type_count

The vast majority of the volume would be delivered by RED TRUCK (85% of the total), with the remaining portion delivered by WHITE TRUCK (15%).

Code
# Group by LOCAL_FOUNT_ONLY and FLEET_TYPE and calculate total delivered volume (QTD_DLV_TOTAL)
summary_fleet_type_pct <- full_data_customer %>%
  group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%
  summarise(
    total_QTD_DLV = sum(QTD_DLV_TOTAL, na.rm = TRUE),  # Sum of QTD_DLV_TOTAL for each FLEET_TYPE and LOCAL_FOUNT_ONLY
    .groups = "drop"
  ) %>%
  group_by(LOCAL_FOUNT_ONLY) %>%
  mutate(
    pct_QTD_DLV = total_QTD_DLV / sum(total_QTD_DLV) * 100  # Calculate the percentage of delivered volume per LOCAL_FOUNT_ONLY group
  )

# Ensure LOCAL_FOUNT_ONLY and FLEET_TYPE are factors for better plotting
summary_fleet_type_pct$LOCAL_FOUNT_ONLY <- factor(summary_fleet_type_pct$LOCAL_FOUNT_ONLY, levels = c("0", "1"))
summary_fleet_type_pct$FLEET_TYPE <- factor(summary_fleet_type_pct$FLEET_TYPE, levels = c("RED TRUCK", "WHITE TRUCK"))

# Plot the percentage of delivered volume (QTD_DLV_TOTAL) by FLEET_TYPE and LOCAL_FOUNT_ONLY
ggplot(summary_fleet_type_pct, aes(x = LOCAL_FOUNT_ONLY, y = pct_QTD_DLV, fill = FLEET_TYPE)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.6) +
  geom_text(aes(label = paste0(round(pct_QTD_DLV, 1), "%")), 
            position = position_dodge(width = 0.8), vjust = 0.0, size = 3.5) +
  labs(title = "Percentage of Delivered Volume by Fleet Type and Local Fountain Only (23 & 24)") +
  scale_fill_manual(values = c("RED TRUCK" = "#B33951", "WHITE TRUCK" = "#D3D3D3")) +  # Set colors for RED and WHITE TRUCK
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    legend.title = element_blank(),  # Remove legend title
    legend.position = "right",  # Position legend on the right side
    legend.box = "vertical",  # Ensure vertical arrangement for the legend
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    strip.text = element_text(size = 10, face = "bold"),
    strip.background = element_blank(),
    axis.text.x = element_text(size = 10),
    panel.spacing = unit(1, "lines"),
    strip.text.y = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  scale_x_discrete(labels = c("0" = "Others", "1" = "Local Fountain Only")) +
  guides(fill = guide_legend(title = "Fleet Type"))  # Add a legend title

Code
# Group and calculate the percentage of delivered volume (QTD_DLV_TOTAL) by FLEET_TYPE and LOCAL_FOUNT_ONLY
summary_fleet_type_count_pct <- full_data_customer %>%
  group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%
  summarise(
    total_QTD_DLV = sum(QTD_DLV_TOTAL, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  group_by(LOCAL_FOUNT_ONLY) %>%
  mutate(
    pct_QTD_DLV = total_QTD_DLV / sum(total_QTD_DLV) * 100  # Percentage of delivered volume within each group
  )

# Display the summary with the percentage of delivered volume by FLEET_TYPE and LOCAL_FOUNT_ONLY
#summary_fleet_type_count_pct

Considering the customer groups independently, nearly 59% of the volume delivered to local partners purchasing fountain only would be transported by RED TRUCKS, while for the remaining customers, almost 85% of the volume would be delivered by RED TRUCKS.

3.6 - Questions and Considerations on Missing Data and Unknown Classes

After the first portion of the EDA, there is a better understanding of the data, but not all questions have been answered. These will continue to be explored in the next section, though some may remain unresolved due to the nature of the questions. The following questions have been identified:

  • Based on the available data, what would be a robust statistical approach to calculate the customer growth rate? A simplistic approach was initially used, relying on the average as a reference to visualize the data. However, a more validated method could certainly be applied.

  • What is the average load capacity of a Red Truck compared to a White Truck?

  • Adding an ID for individual account executives to the customer profile data could be valuable. Is the quality of the account executive a confounding variable when looking at high growth rate customers?

  • Does the company set a delivery deadline in days or hours?

4. Exploratory Data Analysis (EDA) - Part II

After completing the initial analysis and building the datasets, focusing on the set objectives, we will explore more detailed information about the customers.

4.1 Customers overview

Geographical Distribution of Customers

Although the location data is not real, below you can observe its distribution.

Code
# Load the U.S. map
us_map <- map_data("state")

# Create the plot
ggplot() +
  geom_polygon(data = us_map, aes(x = long, y = lat, group = group),
               fill = "lightblue", color = "white") +
  geom_point(data = full_data_customer, aes(x = LONGITUDE, y = LATITUDE),
             color = "#B33951", alpha = 0.6, size = 0.5) +
  coord_fixed(1.3) +
  theme_minimal() +
  labs(title = "Customers Geographical Distribution") +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

Code
# Calculate the number of unique customers
unique_customers <- length(unique(full_data_customer$CUSTOMER_NUMBER))

# Display the frequency of each value for the 'CHAIN_MEMBER' column
chain_member_count <- table(full_data_customer$CHAIN_MEMBER)

# Calculate the number of unique primary group numbers
unique_primary_groups <- length(unique(full_data_customer$PRIMARY_GROUP_NUMBER))

# Sum the costs for cases and gallons in 2023 and 2024
cost_dlv <- sum(full_data_customer$COST_CA_23, full_data_customer$COST_CA_24, 
                full_data_customer$COST_GAL_23, full_data_customer$COST_GAL_24, 
                na.rm = TRUE)

# Summing the number of transactions for cases and gallons in 2023 and 2024
trans_dlv <- sum(full_data_customer$NUM_TRANS_DLV_CA_23, full_data_customer$NUM_TRANS_DLV_CA_24, 
                 full_data_customer$NUM_TRANS_DLV_GAL_23, full_data_customer$NUM_TRANS_DLV_GAL_24, 
                 na.rm = TRUE)

# Summing the quantity delivered of cases and gallons in 2023 and 2024
qtd_dlv <- sum(
  full_data_customer$QTD_DLV_CA_2023, full_data_customer$QTD_DLV_GAL_2023,
  full_data_customer$QTD_DLV_CA_2024, full_data_customer$QTD_DLV_GAL_2024,
  na.rm = TRUE
)

# Average cost per delivery transaction
avg_cost_per_transaction <- cost_dlv / trans_dlv

# Average cost per case or gallon delivered
avg_cost_per_quantity <- cost_dlv / qtd_dlv

# Display results
unique_customers  # Number of unique customers
chain_member_count  # Frequency count of each chain member
unique_primary_groups  # Number of unique primary group numbers
cost_dlv  # Total cost for cases and gallons in 2023 and 2024
trans_dlv  # Total number of transactions for cases and gallons in 2023 and 2024
avg_cost_per_transaction  # Average cost per delivery transaction
avg_cost_per_quantity  # Average cost per case or gallon delivered

After removing customers who did not make any transactions in 2023 and 2024, there are
30,320 unique customers who made transactions during these years.
Of these, 18,061 are unique outlets, while 12,259 belong to 1,020 different chains that have transacted with the company.

All of their delivery transactions represented a total cost of approximately $67,907,394,
with an average of $55.8 per delivery transaction and $1.88 per case or gallon delivered.

4.2 Local Market Partners (Fountain Only)

Code
# Clean
clean_data <- full_data %>%
  filter(!is.na(LOCAL_FOUNT_ONLY)) %>%  # Filtering data where LOCAL_FOUNT_ONLY is not NA
  mutate(LOCAL_FOUNT_ONLY = factor(LOCAL_FOUNT_ONLY, levels = c("0", "1")))  # Converting to factor

# Aggregate data by LOCAL_FOUNT_ONLY and create the plot
summary_data <- clean_data %>%
  group_by(LOCAL_FOUNT_ONLY) %>%
  summarise(
    customers = n_distinct(CUSTOMER_NUMBER),
    transactions = n(),
    qtd_cas = sum(DELIVERED_CASES, na.rm = TRUE),
    qtd_gal = sum(DELIVERED_GALLONS, na.rm = TRUE),
    total_qtd = qtd_cas + qtd_gal,
    .groups = "drop"
  ) %>%
  mutate(
    pct_cust = customers / sum(customers) * 100,
    pct_trans = transactions / sum(transactions) * 100,
    pct_qtd = total_qtd / sum(total_qtd) * 100,
    pct_gal = qtd_gal / sum(qtd_gal) * 100
  ) %>%
  rename(LFO = LOCAL_FOUNT_ONLY)

# Transform data to long format
summary_data_long <- summary_data %>%
  pivot_longer(cols = starts_with("pct_"), names_to = "metric", values_to = "percentage") %>%
  mutate(
    metric = factor(metric, 
                    levels = c("pct_cust", "pct_trans", "pct_gal", "pct_qtd"),
                    labels = c("Customers", "Delivery Transactions", "Gallons", "Total (Cases+Gallons)"))
  )

# Convert LFO to factor
summary_data_long$LFO <- factor(summary_data_long$LFO, levels = c("0", "1"))

# Create the plot
ggplot(summary_data_long, aes(x = LFO, y = percentage, fill = LFO)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.6) +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            position = position_dodge(width = 0.8), vjust = 0.2, size = 3.5) +
  facet_wrap(~ metric, scales = "fixed", ncol = 2) +
  labs(title = "Percentage Breakdown by Consumption Pattern") +
  scale_fill_manual(values = c("0" = "#8ED081", "1" = "#A7ADC6")) +
  scale_y_continuous(labels = percent_format(scale = 1)) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    legend.position = "none",
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    strip.text = element_text(size = 10, face = "bold"),
    strip.background = element_blank(),
    axis.text.x = element_text(size = 10),
    panel.spacing = unit(1, "lines"),
    strip.text.y = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  scale_x_discrete(labels = c("0" = "Others", "1" = "Local Fountain Only"))

Local market partners who purchase only fountain drinks (Gallons) account for 4.5% of the customers and represent 6% of the company’s gallons demand. Their delivery transaction volume is low, contributing only 3%, and the volume delivered accounts for just 1.6% of the total negotiated volume.

Code
# Group and calculate sums and percentages
summary_full_data <- full_data_customer %>%
  group_by(LOCAL_FOUNT_ONLY) %>%
  summarise(
    total_cost_gal = sum(COST_GAL_23, na.rm = TRUE) + sum(COST_GAL_24, na.rm = TRUE),
    total_cost_ca = sum(COST_CA_23, na.rm = TRUE) + sum(COST_CA_24, na.rm = TRUE),
    total_cost_all = total_cost_gal + total_cost_ca,
    .groups = "drop"
  ) %>%
  mutate(
    pct_cost_gal = total_cost_gal / total_cost_all * 100,
    pct_cost_ca = total_cost_ca / total_cost_all * 100,
    pct_total = total_cost_all / sum(total_cost_all) * 100
  ) %>%
  rename(LFO = LOCAL_FOUNT_ONLY)

# Transform data into long format for totals
summary_full_data_long <- summary_full_data %>%
  pivot_longer(
    cols = starts_with("total_"), 
    names_to = "metric", 
    values_to = "value"
  ) %>%
  mutate(
    metric = factor(metric, 
                    levels = c("total_cost_gal", "total_cost_ca", "total_cost_all"),
                    labels = c("Cost Gallons (23 & 24)", "Cost Cases (23 & 24)", "Total Cost"))
  )

# For percentages
summary_full_data_pct_long <- summary_full_data %>%
  pivot_longer(
    cols = starts_with("pct_"), 
    names_to = "metric", 
    values_to = "percentage"
  ) %>%
  mutate(
    metric = factor(metric, 
                    levels = c("pct_cost_gal", "pct_cost_ca", "pct_total"),
                    labels = c("Percentage Cost Gallons (23 & 24)", "Percentage Cost Cases (23 & 24)", "Percentage Total Cost"))
  )

# Ensure LFO is a factor
summary_full_data_long$LFO <- factor(summary_full_data_long$LFO, levels = c("0", "1"))

# Plot for total costs
ggplot(summary_full_data_long, aes(x = LFO, y = value, fill = LFO)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.6) +
  geom_text(aes(label = scales::comma(value, prefix = "$")), 
            position = position_dodge(width = 0.8), vjust = 0.2, size = 3.5) +
  facet_wrap(~ metric, scales = "fixed", nrow = 1) + 
  labs(title = "Total Costs by Consumption Pattern") +
  scale_fill_manual(values = c("0" = "#8ED081", "1" = "#A7ADC6")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    legend.position = "none",
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    strip.text = element_text(size = 10, face = "bold"),
    strip.background = element_blank(),
    axis.text.x = element_text(size = 10),
    panel.spacing = unit(1, "lines"),
    strip.text.y = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  scale_x_discrete(labels = c("0" = "Others", "1" = "Local Fountain Only"))

In the years 2023 and 2024, the total delivery cost was 67.9 million, of which only 1.2 million was allocated to local market partners.

Code
# Group and calculate sums and percentages by LFO
summary_full_data <- full_data_customer %>%
  group_by(LOCAL_FOUNT_ONLY) %>%
  summarise(
    total_cost_gal = sum(COST_GAL_23, na.rm = TRUE) + sum(COST_GAL_24, na.rm = TRUE),
    total_cost_ca = sum(COST_CA_23, na.rm = TRUE) + sum(COST_CA_24, na.rm = TRUE),
    total_cost_all = total_cost_gal + total_cost_ca,
    .groups = "drop"
  ) %>%
  mutate(
    pct_cost_gal = total_cost_gal / sum(total_cost_gal) * 100,  
    pct_cost_ca = total_cost_ca / sum(total_cost_ca) * 100,      
    pct_total = total_cost_all / sum(total_cost_all) * 100        
  ) %>%
  rename(LFO = LOCAL_FOUNT_ONLY)

# Transform data into long format for percentages
summary_full_data_pct_long <- summary_full_data %>%
  pivot_longer(
    cols = starts_with("pct_"), 
    names_to = "metric", 
    values_to = "percentage"
  ) %>%
  mutate(
    metric = factor(metric, 
                    levels = c("pct_cost_gal", "pct_cost_ca", "pct_total"),
                    labels = c("% Cost - Gallons", "% Cost - Cases", "% Total Cost"))
  )

# Ensure LFO is a factor
summary_full_data_pct_long$LFO <- factor(summary_full_data_pct_long$LFO, levels = c("0", "1"))

# Plot for percentages
ggplot(summary_full_data_pct_long, aes(x = LFO, y = percentage, fill = LFO)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.6) +
  geom_text(aes(label = scales::comma(percentage, suffix = "%")), 
            position = position_dodge(width = 0.8), vjust = 0.2, size = 3.5) +
  facet_wrap(~ metric, scales = "fixed", nrow = 1) +  
  labs(title = "Percentage Costs by Consumption Pattern") +
  scale_fill_manual(values = c("0" = "#8ED081", "1" = "#A7ADC6")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    legend.position = "none",
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    strip.text = element_text(size = 10, face = "bold"),
    strip.background = element_blank(),
    axis.text.x = element_text(size = 10),
    panel.spacing = unit(1, "lines"),
    strip.text.y = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  scale_x_discrete(labels = c("0" = "Others", "1" = "Local Fountain Only"))

Thus, in 2023 and 2024, the local partners who consume only fountain accounted for 1.8% of the total delivery costs. When we look at their share specifically in gallon deliveries, their participation rises to 7.3%.

4.3 Customers History

Below is the chart showing the density of customers in relation to the start of their partnership and their first delivery.

Code
# Gather the data for ON_BOARDING_DATE and FIRST_DELIVERY_DATE, filtering out 2025 data
profile_data_long <- profile_data %>%
  filter(!is.na(ON_BOARDING_DATE) & !is.na(FIRST_DELIVERY_DATE)) %>%
  # Filter out 2025 data to avoid showing deliveries in that year
  filter(format(FIRST_DELIVERY_DATE, "%Y") != "2025") %>%
  pivot_longer(cols = c(ON_BOARDING_DATE, FIRST_DELIVERY_DATE), 
               names_to = "Event", values_to = "Date") %>%
  # Set factor levels to ensure ON_BOARDING_DATE appears first in the plot
  mutate(Event = factor(Event, levels = c("ON_BOARDING_DATE", "FIRST_DELIVERY_DATE")))

# Create density plots with facet_wrap
ggplot(profile_data_long, aes(x = Date, fill = Event, color = Event)) +
  geom_density(alpha = 0.5) +  # Adjust transparency for better visualization
  facet_wrap(~ Event, scales = "free", ncol = 2) +  # Create facets for each variable
  labs(title = "Density Plots of Onboarding and First Delivery Dates",
       x = "Date",
       y = "Density") +
  scale_fill_manual(values = c("steelblue", "orange")) +  # Set custom colors (first delivery = orange)
  scale_color_manual(values = c("steelblue", "orange")) +
  scale_y_continuous(labels = scales::label_number()) +  # Remove scientific notation on Y axis
  theme_minimal() +
  theme(legend.position = "none")  # Remove the legend for a cleaner plot

The vast majority of customers started to appear after 2010. The figures for the first deliveries show that, since 2016, at least 2,000 customers have received their first delivery each year. There were peaks in 2016 and 2017. In 2024, there was a decrease in the number of customers receiving their first delivery compared to 2023.

Code
# Reshape data: Gather ON_BOARDING_DATE and FIRST_DELIVERY_DATE
profile_data_long <- profile_data %>%
  filter(!is.na(ON_BOARDING_DATE) & !is.na(FIRST_DELIVERY_DATE)) %>%
  pivot_longer(cols = c(ON_BOARDING_DATE, FIRST_DELIVERY_DATE), 
               names_to = "Event", values_to = "Date")

# Set factor levels to ensure ON_BOARDING_DATE appears first in the plot
profile_data_long$Event <- factor(profile_data_long$Event, levels = c("ON_BOARDING_DATE", "FIRST_DELIVERY_DATE"))

# Ensure Date is in Date format
profile_data_long$Date <- as.Date(profile_data_long$Date)

# Create histograms with yearly aggregation
ggplot(profile_data_long, aes(x = Date, fill = Event)) +
  geom_histogram(binwidth = 365, color = "black", alpha = 0.5, position = "identity") +  
  facet_wrap(~ Event, scales = "free_x", ncol = 2) +  # Free scaling for X axis
  labs(title = "Distribution of Customer Onboarding and First Delivery Dates",
       x = "Date",
       y = "Count") +
  scale_fill_manual(values = c("steelblue", "orange")) +  # Custom colors for events
  scale_x_date(labels = scales::date_format("%Y"), expand = c(0.01, 0.01)) +  # Show only year on X-axis
  scale_y_continuous(labels = scales::label_number()) +  # Ensure the y-axis is not in scientific notation
  theme_minimal() +
theme(legend.position = "none",  # Remove legend
        axis.text.x = element_text(hjust = -0.1),  
        axis.ticks.x = element_blank(),  
        panel.grid.major.x = element_blank(),  # Remove vertical gridlines
        panel.grid.minor.x = element_blank())  # Remove minor vertical gridlines

4.4 Order Types

The way orders are placed and by whom is important for understanding customer growth potential. Most customer profiles are associated with sales representatives (65.7%). Other methods follow with 17.6%, and MyCoke 360 accounts for nearly 8%, despite only being launched in Summer 2024 to replace MyCoke Legacy.

However, when analyzing actual transactions from 2023 and 2024, the distribution of order types differs significantly from the customer profiles. For example, sales representatives were responsible for only 27.5% of the orders, not 65.7% as listed in the profiles. Therefore, the analysis will be based on actual transactions rather than profile data.

Below are the percentages cases ordered in 2023 and 2024 by order type for each transaction placed in 2023 and 2024.

Code
# Define the custom color palette (Neutral colors from RColorBrewer's "Set3")
custom_palette_type <- brewer.pal(6, "Set3")  # A 6-color palette from Set3

# Summarize data by ORDER_TYPE, summing ORDERED_CASES
data_summary_order_type <- full_data %>%
  group_by(ORDER_TYPE) %>%
  summarise(OrderedCasesSum = sum(ORDERED_CASES, na.rm = TRUE), .groups = 'drop') %>%
  mutate(Percentage = round(OrderedCasesSum / sum(OrderedCasesSum) * 100, 1),
         Percentage = ifelse(Percentage < 0.15, NA, Percentage))  # Set values less than 0.15% to NA (not displayed)

# Create the horizontal bar chart with percentages, now with no aggregation by LOCAL_FOUNT_ONLY
ggplot(data_summary_order_type, aes(x = OrderedCasesSum, y = reorder(ORDER_TYPE, OrderedCasesSum), fill = ORDER_TYPE)) +
  geom_bar(stat = "identity", position = "stack", alpha = 0.5) +  
  geom_text(aes(label = ifelse(!is.na(Percentage), paste(Percentage, "%"), "")),  # Only display text if Percentage is not NA
            position = position_stack(vjust = 0.5), 
            hjust = -0.01, 
            color = "black", size = 3.2) +
  labs(title = "Percentage of Ordered Case Volumes by Order Type (23 & 24)",
       x = NULL, 
       y = NULL) +  
  scale_x_continuous(labels = NULL, expand = expansion(c(0, 0.05))) +  
  scale_fill_manual(values = custom_palette_type) +  # Apply the custom color palette
  theme_minimal() +  
  theme(plot.title = element_text(size = 10, face = "bold")) +  
  theme(axis.text.y = element_text(size = 10),  
        axis.title.x = element_blank(),  
        legend.position = "none",  # Remove the legend, as we don't need it anymore
        panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank())

In the 2023 and 2024 ordered cases transactions, it’s clear that the majority of operations were carried out through digital channels, specifically MyCoke Legacy and MyCoke 360, accounting for 35.5%. This was followed by sales representatives with 25.4%, and call centers with 15.5%. MyCoke 360, which was recently launched, makes up 7.1% of the transactions.

Code
# Summarize data by ORDER_TYPE and LOCAL_FOUNT_ONLY, summing ORDERED_GALLONS
data_summary_order_type <- full_data %>%
  group_by(ORDER_TYPE, LOCAL_FOUNT_ONLY) %>%
  summarise(OrderedGallonsSum = sum(ORDERED_GALLONS, na.rm = TRUE), .groups = 'drop') %>%
  mutate(Percentage = round(OrderedGallonsSum / sum(OrderedGallonsSum) * 100, 1),
         Percentage = ifelse(Percentage < 0.0, NA, Percentage))  # Set values less than 0.15% to NA (not displayed)

# Create the horizontal bar chart with percentages, facet by LOCAL_FOUNT_ONLY
ggplot(data_summary_order_type, aes(x = OrderedGallonsSum, y = reorder(ORDER_TYPE, OrderedGallonsSum), fill = ORDER_TYPE)) +
  geom_bar(stat = "identity", position = "stack", alpha = 0.5) +  
  geom_text(aes(label = ifelse(!is.na(Percentage), paste(Percentage, "%"), "")),  # Only display text if Percentage is not NA
            position = position_stack(vjust = 0.5), 
            hjust = -0.01, 
            color = "black", size = 3.2) +
  labs(title = "% of Ordered Gallons by Order Type and Customer type (2023 & 2024)",
       x = NULL, 
       y = NULL) +  
  scale_x_continuous(labels = NULL, expand = expansion(c(0, 0.05))) +  
  scale_fill_manual(values = custom_palette_type) +  # Apply the custom color palette for ORDER_TYPE
  theme_minimal() +  
  theme(plot.title = element_text(size = 10, face = "bold")) +  
  theme(axis.text.y = element_text(size = 10),  
        axis.title.x = element_blank(),  
        legend.position = "none",  # Hide the legend
        panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank(),
        strip.text = element_text(face = "bold", size = 10),  
        strip.background = element_blank()) +  
  facet_wrap(~ LOCAL_FOUNT_ONLY, scales = "free_y", ncol = 2, labeller = labeller(LOCAL_FOUNT_ONLY = c('0' = 'Others', '1' = 'Local Fountain Only')))  # Facet labels

For gallon orders, only a very small fraction (less than 6%) is represented by Local Market Partners that order Fountain Only. For these customers, the majority of their orders are placed via the call center (2.4%), followed by digital channels (2.2%), and finally sales reps (1.3%).

For the remaining customers, digital channels represent 34.6% (MyCoke360 + Legacy), sales reps 32.5%, and call centers 24.5%.

It can be said that digital channels are the most used, accounting for approximately 35% of the total volume of cases and gallons for all customers. Sales reps have a smaller proportional share for case orders but carry more weight for gallon orders.

Code
# Summarize data by ORDER_TYPE, summing DELIVERED_CASES and DELIVERED_GALLONS
data_summary_order_type <- full_data %>%
  group_by(ORDER_TYPE) %>%
  summarise(
    DeliveredCasesSum = sum(DELIVERED_CASES, na.rm = TRUE),
    DeliveredGallonsSum = sum(DELIVERED_GALLONS, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    TotalVolume = DeliveredCasesSum + DeliveredGallonsSum,
    Percentage = round(TotalVolume / sum(TotalVolume) * 100, 1)
  )

# Create horizontal bar chart with both absolute volume and percentage
ggplot(data_summary_order_type, aes(x = TotalVolume, y = reorder(ORDER_TYPE, TotalVolume), fill = ORDER_TYPE)) +
  geom_bar(stat = "identity", alpha = 0.5) +
  geom_text(
    aes(label = paste(scales::comma(TotalVolume, accuracy = 1), paste0("(", Percentage, "%)"))),
    position = position_stack(vjust = 0.5),
    hjust = -0.01,
    color = "black",
    size = 3.2
  ) +
  scale_x_continuous(
    labels = scales::comma,
    breaks = seq(0, max(data_summary_order_type$TotalVolume), by = 5000000),
    expand = expansion(c(0, 0.05))
  ) +
  scale_fill_manual(values = custom_palette_type) +
  labs(
    title = "Total Delivered Cases and Gallons by Order Type (23 & 24)",
    x = "Volume (units)",
    y = NULL
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(size = 10),
    axis.title.x = element_text(size = 10, face = "plain"),
    legend.position = "none",
    panel.grid.major = element_blank(),
    panel.grid.major.x = element_line(color = "lightgray", size = 0.5),
    panel.grid.minor = element_blank()
  )

In line with the previous points, digital channels account for nearly 36% of the total volume delivered in 2023 and 2024, followed by sales reps at 27.5% and call centers at 18.5%.

Code
# Summarize data by ORDER_TYPE, summing DELIVERED_CASES and DELIVERED_GALLONS
data_summary_order_type <- full_data %>%
  group_by(ORDER_TYPE) %>%
  summarise(
    DeliveredCasesSum = sum(DELIVERED_CASES, na.rm = TRUE),
    DeliveredGallonsSum = sum(DELIVERED_GALLONS, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    TotalVolume = DeliveredCasesSum + DeliveredGallonsSum,
    Percentage = round(TotalVolume / sum(TotalVolume) * 100, 1)
  )

# Create horizontal bar chart with absolute volume and percentage
ggplot(data_summary_order_type, aes(x = TotalVolume, y = reorder(ORDER_TYPE, TotalVolume), fill = ORDER_TYPE)) +
  geom_bar(stat = "identity", alpha = 0.5) +
  geom_text(
    aes(label = paste(scales::comma(TotalVolume, accuracy = 1), paste0("(", Percentage, "%)"))),
    position = position_stack(vjust = 0.5),
    hjust = -0.01,
    color = "black",
    size = 3.2
  ) +
  scale_x_continuous(
    labels = scales::comma,
    breaks = seq(0, max(data_summary_order_type$TotalVolume), by = 5000000),
    expand = expansion(c(0, 0.05))
  ) +
  scale_fill_manual(values = custom_palette_type) +
  labs(
    title = "Total Delivered Cases and Gallons by Order Type (23 & 24)",
    x = "Cost $",
    y = NULL
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(size = 10),
    axis.title.x = element_text(size = 10, face = "plain"),
    legend.position = "none",
    panel.grid.major = element_blank(),
    panel.grid.major.x = element_line(color = "lightgray", size = 0.5),
    panel.grid.minor = element_blank()
  )

Digital channels account for the majority of the costs, representing 40% of the total delivered cost. Notably, call center costs are slightly higher than sales rep costs, suggesting that their smaller volumes are inflating the costs.

Code
# Summarize by ORDER_TYPE and FLEET_TYPE using delivered volume
data_summary_fleet_by_order <- full_data %>%
  filter(!is.na(FLEET_TYPE), !is.na(ORDER_TYPE)) %>%
  group_by(ORDER_TYPE, FLEET_TYPE) %>%
  summarise(TotalDelivered = sum(DELIVERED_CASES + DELIVERED_GALLONS, na.rm = TRUE), .groups = "drop") %>%
  group_by(ORDER_TYPE) %>%
  mutate(Percentage = round(TotalDelivered / sum(TotalDelivered) * 100, 0))

# Order ORDER_TYPE by total delivered volume
order_levels <- data_summary_fleet_by_order %>%
  group_by(ORDER_TYPE) %>%
  summarise(Total = sum(TotalDelivered), .groups = "drop") %>%
  arrange(Total) %>%
  pull(ORDER_TYPE)

# Reorder as factor
data_summary_fleet_by_order$ORDER_TYPE <- factor(data_summary_fleet_by_order$ORDER_TYPE, levels = order_levels)

# Plot
ggplot(data_summary_fleet_by_order, aes(x = TotalDelivered, y = ORDER_TYPE, fill = FLEET_TYPE)) +
  geom_bar(stat = "identity", position = "stack", alpha = 0.6) +  
  geom_text(aes(label = paste0(Percentage, "%")), 
            position = position_stack(vjust = 0.5), 
            hjust = 0, 
            color = "black", size = 3.2) +
  labs(title = "400 gallons threshold X Delivered Volume by Order Type", 
       x = "Volume (units)", 
       y = NULL, 
       fill = "Fleet Type") +  
  scale_x_continuous(
    labels = function(x) paste0(x / 1e6, "M"),
    breaks = c(2500000, 5000000, 7500000, 10000000),
    expand = expansion(c(0, 0.05))
  ) +  
  scale_fill_manual(values = c("RED TRUCK" = "#B33951", "WHITE TRUCK" = "#D3D3D3")) +  
  theme_minimal() +  
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(size = 10),
    axis.title.x = element_text(size = 10, face = "plain"),
    legend.position = "right",
    legend.direction = "vertical",
    panel.grid.major.y = element_blank(),
    panel.grid.major.x = element_line(color = "lightgray", size = 0.5),
    panel.grid.minor = element_blank()
  )

Sales Rep had the highest internal percentage of customers (62%) who would be served by red trucks if the 400-gallon threshold were applied. On the other hand, Call Center showed the highest percentage of customers who would be served by white trucks.

4.5 Channel Types

More than 50% of transactions were made through the DINING channel, followed by GOODS (16.6%), EVENTS (9.2%), and BULK TRADE (8.4%). The remaining channels each represent less than 5% of the total.

Transactions for Local Partners Fountain Only are almost entirely concentrated in DINING, with 2.7% of transactions compared to 47.8% for other channels.

Code
# Calculate the frequency of each COLD_DRINK_CHANNEL
data_summary_cold_drink_channel <- full_data %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(Count = n(), .groups = 'drop') %>%
  mutate(Percentage = round(Count / sum(Count) * 100, 1))


# Create a horizontal bar chart with percentages for COLD_DRINK_CHANNEL
ggplot(data_summary_cold_drink_channel, aes(x = Count, y = reorder(COLD_DRINK_CHANNEL, Count), fill = COLD_DRINK_CHANNEL)) +
  geom_bar(stat = "identity", position = "stack", alpha = 0.5) +  
  geom_text(aes(label = ifelse(!is.na(Percentage), paste(Percentage, "%"), "")),  # Only display text if Percentage is not NA
            position = position_stack(vjust = 0.5), 
            hjust = -0.01, 
            color = "black", size = 3.2) +
  labs(title = "Percentage of Transactions by Cold Drink Channel",
       x = NULL, 
       y = NULL) +  
  scale_x_continuous(labels = NULL, expand = expansion(c(0, 0.05))) +  
  scale_fill_manual(values = cold_drink_channel_colors) +  # Use your custom color palette
  theme_minimal() +  
  theme(plot.title = element_text(size = 10, face = "bold")) +  
  theme(axis.text.y = element_text(size = 10),  
        axis.title.x = element_blank(),  
        legend.position = "none",  # Remove the legend
        panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank())

Code
# Summarize data by COLD_DRINK_CHANNEL, summing the quantities of gallons and cases
data_summary <- full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(
    Total_Volume = sum(QTD_DLV_GAL_2023, na.rm = TRUE) + sum(QTD_DLV_GAL_2024, na.rm = TRUE) + 
                   sum(QTD_DLV_CA_2023, na.rm = TRUE) + sum(QTD_DLV_CA_2024, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(Percentage = round(Total_Volume / sum(Total_Volume) * 100, 1))  # Calculate the percentage

# Create a horizontal bar chart for the percentage of total volume by cold drink channel
ggplot(data_summary, aes(x = Total_Volume / 1e6, y = reorder(COLD_DRINK_CHANNEL, Total_Volume), fill = COLD_DRINK_CHANNEL)) +
  geom_bar(stat = "identity", position = "stack", alpha = 0.7) +  
  geom_text(aes(label = paste(Percentage, "%")), position = position_stack(vjust = 0.5), 
            hjust = -0.01, color = "black", size = 3.2) +
  labs(title = "Percentage of Total Volume (Gallons and Cases) by Cold Drink Channel",
       x = "Quantity in Millions", 
       y = NULL) +  
  scale_x_continuous(
    labels = function(x) paste0(x, "M"),
    breaks = seq(0, 10, by = 2.5),
    expand = expansion(c(0, 0.05))
  ) +
  geom_vline(xintercept = c(2.5, 5, 7.5, 10), color = "lightgray", linetype = "solid", linewidth = 0.3) +
  scale_fill_manual(values = cold_drink_channel_colors) +  
  theme_minimal() +  
  theme(
    plot.title = element_text(size = 10, face = "bold"),  
    axis.text.y = element_text(size = 10),  
    axis.text.x = element_text(size = 10),  
    axis.title.x = element_text(size = 10, face = "bold"),
    legend.position = "none",  
    panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank()
  )

Dining was the segment with the highest total consumption, accounting for 27% of the total, followed by Bulk Trade with 25.8% and Workplace with 13.4%. The following section analyzes the information separately by packaging type (cases and gallons) and customer type.

Code
# Summarize data by COLD_DRINK_CHANNEL and FLEET_TYPE excluding "CONVENTIONAL"
data_summary <- full_data_customer %>%
  filter(COLD_DRINK_CHANNEL != "CONVENTIONAL") %>%  # Exclude "CONVENTIONAL" channel
  group_by(COLD_DRINK_CHANNEL, FLEET_TYPE) %>%
  summarise(
    Total_Volume = sum(QTD_DLV_GAL_2023, na.rm = TRUE) + sum(QTD_DLV_GAL_2024, na.rm = TRUE) + 
                   sum(QTD_DLV_CA_2023, na.rm = TRUE) + sum(QTD_DLV_CA_2024, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  # Calculate the percentage of each Fleet Type within each Cold Drink Channel
  group_by(COLD_DRINK_CHANNEL) %>%
  mutate(Percentage = Total_Volume / sum(Total_Volume) * 100) %>%
  ungroup()

# Create the horizontal bar plot
ggplot(data_summary, aes(x = Total_Volume, y = reorder(COLD_DRINK_CHANNEL, Total_Volume), fill = FLEET_TYPE)) +
  geom_bar(stat = "identity", position = "stack", alpha = 0.7) +  
  geom_text(aes(label = paste0(round(Percentage), "%")), position = position_stack(vjust = 0.5), 
            hjust = 0.4, color = "black", size = 3.2) +  # Round percentages and remove decimal places
  labs(title = "400 gallons Threshold - Total Volume by Cold Drink Channel",
       x = "Total Volume (in Millions)", 
       y = NULL) +  
  scale_x_continuous(labels = scales::comma_format(scale = 1e-6, suffix = "M"),  # Convert axis to millions
                     breaks = seq(2500000, 10000000, by = 2500000)) +  # Define custom x-axis breaks
  scale_fill_manual(values = c("RED TRUCK" = "#B33951", "WHITE TRUCK" = "#D3D3D3")) +  # Custom color palette
  theme_minimal() +  
  theme(plot.title = element_text(size = 10, face = "bold")) +  
  theme(axis.text.y = element_text(size = 10),  
        axis.title.x = element_text(size = 10),  # X-axis title size
        axis.text.x = element_text(size = 10),  # X-axis text size
        legend.position = "bottom",  # Position legend below the plot
        legend.box = "horizontal",  # Display legend items horizontally
        panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank()) +
  # Add vertical lines at specific breaks on the x-axis
  geom_vline(xintercept = c(2500000, 5000000, 7500000, 10000000), color = "gray", linetype = "solid", size = 0.5)

Code
# Summarize data by COLD_DRINK_CHANNEL and FLEET_TYPE
data_summary <- full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL, FLEET_TYPE) %>%
  summarise(
    Total_Volume = sum(QTD_DLV_GAL_2023, na.rm = TRUE) + sum(QTD_DLV_GAL_2024, na.rm = TRUE) + 
                   sum(QTD_DLV_CA_2023, na.rm = TRUE) + sum(QTD_DLV_CA_2024, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  # Calculate the percentage of each Fleet Type within each Cold Drink Channel
  group_by(COLD_DRINK_CHANNEL) %>%
  mutate(Percentage = Total_Volume / sum(Total_Volume) * 100) %>%
  ungroup()

# Create the table
#kable(data_summary, format = "markdown", digits = 1, caption = "Total Volume and Percentage by Cold Drink Channel and Fleet Type")

Above are the percentage representations of the volume that would be served by red and white trucks for the 400-gallon threshold. The majority of the volumes would be delivered by red trucks. The “CONVENTIONAL” segment was not displayed due to its extremely low volume, which would overlap with the labels. In this segment, the proportion is 47% for white trucks and 53% for red trucks.

4.5.1 Cold Drink Channel - Delivered Cases for All Customers

Below are the percentages of cases delivered in 2023 and 2024 for all customers by cold drink channel.

Code
# Summarize data by COLD_DRINK_CHANNEL, summing the quantities of cases (QTD_DLV_CA_2023 and QTD_DLV_CA_2024)
data_summary_cases <- full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(
    Total_Cases = sum(QTD_DLV_CA_2023, na.rm = TRUE) + sum(QTD_DLV_CA_2024, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(Percentage = round(Total_Cases / sum(Total_Cases) * 100, 1))  # Calculate the percentage

# Create a bar chart for the percentage of total cases by cold drink channel
ggplot(data_summary_cases, aes(x = Total_Cases, y = reorder(COLD_DRINK_CHANNEL, Total_Cases), fill = COLD_DRINK_CHANNEL)) +
  geom_bar(stat = "identity", position = "stack", alpha = 0.5) +  
  geom_text(aes(label = paste(Percentage, "%")), position = position_stack(vjust = 0.5), 
            hjust = -0.01, color = "black", size = 3.2) +
  labs(title = "All Customers - Percentage of Cases (23 & 24) by Cold Drink Channel",
       x = "Percentage of Total Cases", 
       y = NULL) +  
  scale_x_continuous(labels = scales::percent_format(scale = 1), expand = expansion(c(0, 0.05))) +  
 scale_fill_manual(values = cold_drink_channel_colors) +  # Apply the custom color palette
  theme_minimal() +  
  theme(plot.title = element_text(size = 10, face = "bold")) +  
  theme(axis.text.y = element_text(size = 10),  
        axis.title.x = element_blank(),   # Remove the x-axis title
        axis.text.x = element_blank(),    # Remove the x-axis text
        legend.position = "none",  # Remove the legend
        panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank())

Code
# Summarize data by COLD_DRINK_CHANNEL, summing the delivery cost for cases (COST_CA_23 and COST_CA_24)
data_summary_cases_cost <- full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(
    Total_Cases_Cost = sum(COST_CA_23, na.rm = TRUE) + sum(COST_CA_24, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(Percentage = round(Total_Cases_Cost / sum(Total_Cases_Cost) * 100, 1))  # Calculate the percentage

# Create a bar chart for the percentage of total cases cost by cold drink channel
ggplot(data_summary_cases_cost, aes(x = Total_Cases_Cost, y = reorder(COLD_DRINK_CHANNEL, Total_Cases_Cost), fill = COLD_DRINK_CHANNEL)) +
  geom_bar(stat = "identity", position = "stack", alpha = 0.5) +  
  geom_text(aes(label = paste(Percentage, "%")), position = position_stack(vjust = 0.5), 
            hjust = -0.01, color = "black", size = 3.2) +
  labs(title = "All Customers - Percentage of Cases Delivery Cost (23 & 24) by Cold Drink Channel",
       x = "Percentage of Total Cases Cost", 
       y = NULL) +  
  scale_x_continuous(labels = scales::percent_format(scale = 1), expand = expansion(c(0, 0.05))) +  
 scale_fill_manual(values = cold_drink_channel_colors) +  # Apply the custom color palette
  theme_minimal() +  
  theme(plot.title = element_text(size = 10, face = "bold")) +  
  theme(axis.text.y = element_text(size = 10),  
        axis.title.x = element_blank(),   # Remove the x-axis title
        axis.text.x = element_blank(),    # Remove the x-axis text
        legend.position = "none",  # Remove the legend
        panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank())

The main segment receiving cases (bottles, cans, etc.) was Bulk Trade with 33%, followed by Workplace with 17%, and Dining with 14.6%. On the other hand, the segment that presented the highest delivery costs for cases was Dining, accounting for 34% of the cost in 2023 and 2024, followed by Goods at 21%, and Bulk Trade at 16%.

The tables below aim to provide detailed information within this group regarding the number of customers, costs, quartile divisions, and other relevant factors.

Code
# Calculate Total Cases, COST_CA, N.Customers, Perct.Customers, AVG_Qtd, and Median.Qtd, then order and format the table
full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(
    Total_Cases = sum(QTD_DLV_CA_2023 + QTD_DLV_CA_2024),
    COST_CA = sum(COST_CA_23) + sum(COST_CA_24),
    # Count only customers where Total_Cases > 0
    N_Customers = n_distinct(CUSTOMER_NUMBER[QTD_DLV_CA_2023 + QTD_DLV_CA_2024 > 0]), 
    # Calculate the total cases per customer, excluding customers with zero total cases
    Total_Cases_Per_Customer = list(QTD_DLV_CA_2023 + QTD_DLV_CA_2024),
    Total_Cost_Per_Customer = list(COST_CA_23 + COST_CA_24),
    .groups = 'drop'
  ) %>%
  mutate(
    # Calculate the average COST_CA per Total_Cases
    AVG_Cost_CA = COST_CA / Total_Cases,  
    # Calculate the percentage of total cases
    PERCT_CASE = round(Total_Cases / sum(Total_Cases) * 100, 1),
    # Calculate the percentage of total customers
    Perct_Customers = round(N_Customers / sum(N_Customers) * 100, 1), # Calculate percentage of customers
    # Calculate the average cases per customer (without decimals)
    AVG_Qtd = round(Total_Cases / N_Customers),  # No decimals for AVG_Qtd
    # Calculate the median of cases per customer, excluding customers with zero cases
    Median_Qtd = sapply(Total_Cases_Per_Customer, function(x) {
      median(x[x > 0], na.rm = TRUE)  # Only consider positive cases for the median
    }),
    # Calculate the median cost per case for each cold drink channel, excluding customers with zero cases
    Median_Cost = sapply(1:length(Total_Cases_Per_Customer), function(i) {
      total_cost <- Total_Cost_Per_Customer[[i]]
      total_cases <- Total_Cases_Per_Customer[[i]]
      median(total_cost[total_cases > 0] / total_cases[total_cases > 0], na.rm = TRUE)  # Median cost per case
    })
  ) %>%
  # Order by Total Cases in descending order (before formatting)
  arrange(desc(Total_Cases)) %>%
  # Calculate Opt_Cost for each COLD_DRINK_CHANNEL based on minimum median delivery cost for CASES
  left_join(
    cost_data %>%
      filter(grepl("CASES", as.character(`RANGE_LEVEL`))) %>%
      group_by(COLD_DRINK_CHANNEL) %>%
      summarise(Opt_Cost = round(min(`Median Delivery Cost`), 2)) %>%
      ungroup(),  # Ensures only 1 line per COLD_DRINK_CHANNEL
    by = "COLD_DRINK_CHANNEL"
  ) %>%
  # Format COST_CA, Total_Cases, AVG_Cost_CA, N_Customers, Perct_Customers, AVG_Qtd, Median.Qtd, and Median.Cost after ordering
  mutate(
    COST_CA = scales::comma(COST_CA),  
    Total_Cases = scales::comma(Total_Cases),  
    AVG_Cost_CA = scales::comma(AVG_Cost_CA, accuracy = 0.01),
    N_Customers = scales::comma(N_Customers), # Format N_Customers
    PERCT_CASE = sprintf("%.1f", PERCT_CASE),  # Ensure 1 decimal place for percentage
    Perct_Customers = sprintf("%.1f", Perct_Customers),  # Ensure 1 decimal place for percentage
    AVG_Qtd = scales::comma(AVG_Qtd),  # Format AVG_Qtd
    Median_Qtd = scales::comma(Median_Qtd),  # Format Median_Qtd
    Median_Cost = scales::comma(Median_Cost, accuracy = 0.01),  # Format Median_Cost
    Opt_Cost = scales::comma(Opt_Cost, accuracy = 0.01)  # Format Opt_Cost
  ) %>%
  # Select columns in the correct order with exact column names
  dplyr::select(
    COLD_DRINK_CHANNEL, Total_Cases, PERCT_CASE, COST_CA, N_Customers, Perct_Customers, AVG_Qtd, Median_Qtd, AVG_Cost_CA, Median_Cost, Opt_Cost
  ) %>%
  # Rename columns to match the desired output
  rename(
    `Channel` = COLD_DRINK_CHANNEL,
    `T.Cases` = Total_Cases,
    `Cases %` = PERCT_CASE,
    `T.Cost $` = COST_CA,
    `N.Cust` = N_Customers,
    `P.Cust %` = Perct_Customers,
    `Avg.Qtd.Cust` = AVG_Qtd,
    `Median.Qtd.Cust` = Median_Qtd,
    `Avg.Cost.Cust $` = AVG_Cost_CA,
    `Med.Cost.Cust $` = Median_Cost,
    `Opt.Cost $` = Opt_Cost
  ) %>%
  kable("html", escape = FALSE, align = "c", col.names = c("Channel", "T.Cases", "Cases %", "T.Cost $", "N.Cust", "P.Cust %", "Avg.Qtd.Cust", "Median.Qtd.Cust", "Avg.Cost.Cust $", "Med.Cost.Cust $", "Opt.Cost $")) %>%
  kable_styling(full_width = FALSE, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:11, width = "6em") %>%
  row_spec(0, bold = TRUE, color = "black", background = "#ADD8E6") %>%
  add_header_above(c("CASES (23 & 24) - Deliveries by Cold Drink Channel - All Customers" = 11)) %>%
  kable_paper("striped", full_width = FALSE)
CASES (23 & 24) - Deliveries by Cold Drink Channel - All Customers
Channel T.Cases Cases % T.Cost $ N.Cust P.Cust % Avg.Qtd.Cust Median.Qtd.Cust Avg.Cost.Cust $ Med.Cost.Cust $ Opt.Cost $
BULK TRADE 8,687,959 32.9 8,127,990 1,278 5.3 6,798 1,239 0.94 3.53 0.73
WORKPLACE 4,567,596 17.3 2,299,625 712 2.9 6,415 164 0.50 8.06 0.37
DINING 3,859,778 14.6 17,429,159 10,929 45.2 353 82 4.52 8.59 2.05
GOODS 3,494,064 13.2 10,780,042 5,542 22.9 630 205 3.09 7.33 1.09
EVENT 2,796,241 10.6 5,677,840 2,785 11.5 1,004 230 2.03 5.59 1.17
PUBLIC SECTOR 1,422,915 5.4 2,805,085 1,411 5.8 1,008 244 1.97 4.97 1.07
WELLNESS 903,700 3.4 1,529,502 340 1.4 2,658 532 1.69 4.61 1.23
ACCOMMODATION 695,490 2.6 2,507,698 1,150 4.8 605 326 3.61 5.54 1.48
CONVENTIONAL 6,337 0.0 73,937 53 0.2 120 64 11.67 14.20 5.34
Code
############

# Calculate Quartiles, Customer Count, and Volume Distribution
full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(
    # Store the total cases per customer, excluding zero values
    Cases_Per_Customer = list(QTD_DLV_CA_2023 + QTD_DLV_CA_2024),
    .groups = 'drop'
  ) %>%
  mutate(
    # Calculate the average and median cases per customer
    `Avg.Qtd.Cust` = sapply(Cases_Per_Customer, function(x) mean(x[x > 0])),
    `Median.Qtd.Cust` = sapply(Cases_Per_Customer, function(x) median(x[x > 0])),
    # Compute quartiles for quantity
    `1Quart.Qtd` = sapply(Cases_Per_Customer, function(x) quantile(x[x > 0], 0.25)),
    `2Quart.Qtd` = sapply(Cases_Per_Customer, function(x) quantile(x[x > 0], 0.50)),  # Median (Q2)
    `3Quart.Qtd` = sapply(Cases_Per_Customer, function(x) quantile(x[x > 0], 0.75))
  ) %>%
  rowwise() %>%  # Ensure calculations are row-wise based on quartile values
  mutate(
    # Extract case values from the list
    Case_Values = list(unlist(Cases_Per_Customer)),
    # Calculate total cases volume per quartile using the correct conditions
    `1Quart.Vol` = sum(Case_Values[which(Case_Values > 0 & Case_Values <= `1Quart.Qtd`)]),
    `2Quart.Vol` = sum(Case_Values[which(Case_Values > `1Quart.Qtd` & Case_Values <= `2Quart.Qtd`)]),
    `3Quart.Vol` = sum(Case_Values[which(Case_Values > `2Quart.Qtd` & Case_Values <= `3Quart.Qtd`)]),
    `4Quart.Vol` = sum(Case_Values[which(Case_Values > `3Quart.Qtd`)]),
    # Calculate the total volume for the quartiles (1 to 4) in each channel
    Total_Vol = `1Quart.Vol` + `2Quart.Vol` + `3Quart.Vol` + `4Quart.Vol`,
    # Calculate percentages based on the sum of volumes from all quartiles for each channel
    `1Q.Vol%` = round((`1Quart.Vol` / Total_Vol) * 100, 1),
    `2Q.Vol%` = round((`2Quart.Vol` / Total_Vol) * 100, 1),
    `3Q.Vol%` = round((`3Quart.Vol` / Total_Vol) * 100, 1),
    `4Q.Vol%` = round((`4Quart.Vol` / Total_Vol) * 100, 1)
  ) %>%
  ungroup() %>%  # Remove row-wise grouping
  # Order by Avg.Qtd.Cust in descending order
  arrange(desc(`Avg.Qtd.Cust`)) %>%
  # Format numbers for readability
  mutate(
    `Avg.Qtd.Cust` = scales::comma(`Avg.Qtd.Cust`, accuracy = 1),
    `Median.Qtd.Cust` = scales::comma(`Median.Qtd.Cust`, accuracy = 1),
    `1Quart.Qtd` = scales::comma(`1Quart.Qtd`, accuracy = 1),
    `2Quart.Qtd` = scales::comma(`2Quart.Qtd`, accuracy = 1),
    `3Quart.Qtd` = scales::comma(`3Quart.Qtd`, accuracy = 1),
    `1Quart.Vol` = scales::comma(`1Quart.Vol`, accuracy = 1),
    `2Quart.Vol` = scales::comma(`2Quart.Vol`, accuracy = 1),
    `3Quart.Vol` = scales::comma(`3Quart.Vol`, accuracy = 1),
    `4Quart.Vol` = scales::comma(`4Quart.Vol`, accuracy = 1),
    `1Q.Vol%` = formatC(`1Q.Vol%`, format = "f", digits = 1),
    `2Q.Vol%` = formatC(`2Q.Vol%`, format = "f", digits = 1),
    `3Q.Vol%` = formatC(`3Q.Vol%`, format = "f", digits = 1),
    `4Q.Vol%` = formatC(`4Q.Vol%`, format = "f", digits = 1)
  ) %>%
  # Select only required columns
  dplyr::select(
    COLD_DRINK_CHANNEL, `Avg.Qtd.Cust`, `Median.Qtd.Cust`, `1Quart.Qtd`, `2Quart.Qtd`, `3Quart.Qtd`, 
    `1Quart.Vol`, `1Q.Vol%`, `2Quart.Vol`, `2Q.Vol%`, `3Quart.Vol`, `3Q.Vol%`, `4Quart.Vol`, `4Q.Vol%`
  ) %>%
  # Rename columns
  rename(
    `Channel` = COLD_DRINK_CHANNEL
  ) %>%
  kable("html", escape = FALSE, align = "c", col.names = c("Channel", "Avg.Qtd.Cust", "Median.Qtd.Cust", "1Quart.Qtd", "2Quart.Qtd", "3Quart.Qtd", "1Quart.Vol", "1Q.Vol%", "2Quart.Vol", "2Q.Vol%", "3Quart.Vol", "3Q.Vol%", "4Quart.Vol", "4Q.Vol%")) %>%
  kable_styling(full_width = FALSE, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:14, width = "6em") %>%
  row_spec(0, bold = TRUE, color = "black", background = "#ADD8E6") %>%
  add_header_above(c("CASES (23 & 24) - Quartile Analysis by Cold Drink Channel - All Customers" = 14)) %>%
  kable_paper("striped", full_width = FALSE)
CASES (23 & 24) - Quartile Analysis by Cold Drink Channel - All Customers
Channel Avg.Qtd.Cust Median.Qtd.Cust 1Quart.Qtd 2Quart.Qtd 3Quart.Qtd 1Quart.Vol 1Q.Vol% 2Quart.Vol 2Q.Vol% 3Quart.Vol 3Q.Vol% 4Quart.Vol 4Q.Vol%
BULK TRADE 6,798 1,239 384 1,239 4,162 53,070 0.6 236,734 2.7 743,904 8.6 7,654,251 88.1
WORKPLACE 6,415 164 40 164 546 2,622 0.1 16,174 0.4 54,740 1.2 4,494,060 98.4
WELLNESS 2,658 532 101 532 2,400 3,224 0.4 22,891 2.5 99,518 11.0 778,067 86.1
PUBLIC SECTOR 1,008 244 68 244 760 10,971 0.8 48,995 3.4 162,280 11.4 1,200,669 84.4
EVENT 1,004 230 56 230 753 17,174 0.6 89,181 3.2 296,563 10.6 2,393,323 85.6
GOODS 630 205 98 205 466 68,717 2.0 206,577 5.9 414,603 11.9 2,804,166 80.3
ACCOMMODATION 605 326 99 326 668 12,725 1.8 57,122 8.2 138,511 19.9 487,132 70.0
DINING 353 82 16 82 318 17,073 0.4 116,307 3.0 479,779 12.4 3,246,620 84.1
CONVENTIONAL 120 64 26 64 138 228 3.6 583 9.2 1,196 18.9 4,330 68.3

The tables above can be used for different analyses, which will not be discussed here. It is worth highlighting that the bulk trade sector has a high number of outliers, which cause its annual volume average to be very high, while the median is about 5 times lower. This impact can also be observed in the delivery costs.

4.5.2 Cold Drink Channel - Delivered Gallons for All Customers

Code
# Summarize data by COLD_DRINK_CHANNEL, summing the quantities of gallons (QTD_DLV_GAL_2023 and QTD_DLV_GAL_2024)
data_summary_gallons <- full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(
    Total_Gallons = sum(QTD_DLV_GAL_2023, na.rm = TRUE) + sum(QTD_DLV_GAL_2024, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(Percentage = round(Total_Gallons / sum(Total_Gallons) * 100, 1))  # Calculate the percentage

# Create a bar chart for the percentage of total gallons by cold drink channel
ggplot(data_summary_gallons, aes(x = Total_Gallons, y = reorder(COLD_DRINK_CHANNEL, Total_Gallons), fill = COLD_DRINK_CHANNEL)) +
  geom_bar(stat = "identity", position = "stack", alpha = 0.5) +  
  geom_text(aes(label = paste(Percentage, "%")), position = position_stack(vjust = 0.5), 
            hjust = -0.01, color = "black", size = 3.2) +
  labs(title = "All Customers - Percentage of Gallons (23 & 24) by Cold Drink Channel",
       x = "Percentage of Total Gallons", 
       y = NULL) +  
  scale_x_continuous(labels = scales::percent_format(scale = 1), expand = expansion(c(0, 0.05))) +  
 scale_fill_manual(values = cold_drink_channel_colors) +  # Apply the custom color palette
  theme_minimal() +  
  theme(plot.title = element_text(size = 10, face = "bold")) +  
  theme(axis.text.y = element_text(size = 10),  
        axis.title.x = element_blank(),   # Remove the x-axis title
        axis.text.x = element_blank(),    # Remove the x-axis text
        legend.position = "none",  # Remove the legend
        panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank())

Code
# Summarize data by COLD_DRINK_CHANNEL, summing the delivery cost for gallons (COST_GAL_23 and COST_GAL_24)
data_summary_gallons_cost <- full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(
    Total_Gallons_Cost = sum(COST_GAL_23, na.rm = TRUE) + sum(COST_GAL_24, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(Percentage = round(Total_Gallons_Cost / sum(Total_Gallons_Cost) * 100, 1))  # Calculate the percentage

# Create a bar chart for the percentage of total gallons cost by cold drink channel
ggplot(data_summary_gallons_cost, aes(x = Total_Gallons_Cost, y = reorder(COLD_DRINK_CHANNEL, Total_Gallons_Cost), fill = COLD_DRINK_CHANNEL)) +
  geom_bar(stat = "identity", position = "stack", alpha = 0.5) +  
  geom_text(aes(label = paste(Percentage, "%")), position = position_stack(vjust = 0.5), 
            hjust = -0.01, color = "black", size = 3.2) +
  labs(title = "All Customers - Percentage of Gallons Delivery Cost (23 & 24) by Cold Drink Channel",
       x = "Percentage of Total Gallons Cost", 
       y = NULL) +  
  scale_x_continuous(labels = scales::percent_format(scale = 1), expand = expansion(c(0, 0.05))) +  
 scale_fill_manual(values = cold_drink_channel_colors) +  # Apply the custom color palette
  theme_minimal() +  
  theme(plot.title = element_text(size = 10, face = "bold")) +  
  theme(axis.text.y = element_text(size = 10),  
        axis.title.x = element_blank(),   # Remove the x-axis title
        axis.text.x = element_blank(),    # Remove the x-axis text
        legend.position = "none",  # Remove the legend
        panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank())

For gallons, the dining segment is the most representative, accounting for 61% of the volume delivered in 2023 and 2024, and 73% of the cost of gallons. The second segment is events, with 18.7% (10% of the cost), followed by bulk trade with 6.5% (3% of the cost).

The tables below aim to provide detailed information within this group regarding the number of customers, costs, quartile divisions, and other relevant factors.

Code
# Calculate Total Gallons, COST_GAL, N.Customers, Perct.Customers, AVG_Qtd, and Median.Qtd, then order and format the table
full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(
    Total_Gallons = sum(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),
    COST_GAL = sum(COST_GAL_23) + sum(COST_GAL_24),
    # Count only customers where Total_Gallons > 0
    N_Customers = n_distinct(CUSTOMER_NUMBER[QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024 > 0]), 
    # Calculate the total gallons per customer, excluding customers with zero total gallons
    Total_Gallons_Per_Customer = list(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),
    Total_Cost_Per_Customer = list(COST_GAL_23 + COST_GAL_24),
    .groups = 'drop'
  ) %>%
  mutate(
    # Calculate the average COST_GAL per Total_Gallons
    AVG_Cost_GAL = COST_GAL / Total_Gallons,  
    # Calculate the percentage of total gallons
    PERCT_GAL = round(Total_Gallons / sum(Total_Gallons) * 100, 1),
    # Calculate the percentage of total customers
    Perct_Customers = round(N_Customers / sum(N_Customers) * 100, 1), # Calculate percentage of customers
    # Calculate the average gallons per customer (without decimals)
    AVG_Qtd = round(Total_Gallons / N_Customers),  # No decimals for AVG_Qtd
    # Calculate the median of gallons per customer, excluding customers with zero gallons
    Median_Qtd = sapply(Total_Gallons_Per_Customer, function(x) {
      median(x[x > 0], na.rm = TRUE)  # Only consider positive gallons for the median
    }),
    # Calculate the median cost per gallon for each cold drink channel, excluding customers with zero gallons
    Median_Cost = sapply(1:length(Total_Gallons_Per_Customer), function(i) {
      total_cost <- Total_Cost_Per_Customer[[i]]
      total_gallons <- Total_Gallons_Per_Customer[[i]]
      median(total_cost[total_gallons > 0] / total_gallons[total_gallons > 0], na.rm = TRUE)  # Median cost per gallon
    })
  ) %>%
  # Order by Total Gallons in descending order (before formatting)
  arrange(desc(Total_Gallons)) %>%
  # Calculate Opt_Cost for each COLD_DRINK_CHANNEL based on minimum median delivery cost for GALLONS
  left_join(
    cost_data %>%
      filter(grepl("GALLONS", as.character(`RANGE_LEVEL`))) %>%
      group_by(COLD_DRINK_CHANNEL) %>%
      summarise(Opt_Cost = round(min(`Median Delivery Cost`), 2)) %>%
      ungroup(),  # Ensures only 1 line per COLD_DRINK_CHANNEL
    by = "COLD_DRINK_CHANNEL"
  ) %>%
  # Format COST_GAL, Total_Gallons, AVG_Cost_GAL, N_Customers, Perct_Customers, AVG_Qtd, Median.Qtd, and Median.Cost after ordering
  mutate(
    COST_GAL = scales::comma(COST_GAL),  
    Total_Gallons = scales::comma(Total_Gallons),  
    AVG_Cost_GAL = scales::comma(AVG_Cost_GAL, accuracy = 0.01),
    N_Customers = scales::comma(N_Customers), # Format N_Customers
    PERCT_GAL = sprintf("%.1f", PERCT_GAL),  # Ensure 1 decimal place for percentage
    Perct_Customers = sprintf("%.1f", Perct_Customers),  # Ensure 1 decimal place for percentage
    AVG_Qtd = scales::comma(AVG_Qtd),  # Format AVG_Qtd
    Median_Qtd = scales::comma(Median_Qtd),  # Format Median_Qtd
    Median_Cost = scales::comma(Median_Cost, accuracy = 0.01),  # Format Median_Cost
    Opt_Cost = scales::comma(Opt_Cost, accuracy = 0.01)  # Format Opt_Cost
  ) %>%
  # Select columns in the correct order with exact column names
  dplyr::select(
    COLD_DRINK_CHANNEL, Total_Gallons, PERCT_GAL, COST_GAL, N_Customers, Perct_Customers, AVG_Qtd, Median_Qtd, AVG_Cost_GAL, Median_Cost, Opt_Cost
  ) %>%
  # Rename columns to match the desired output
  rename(
    `Channel` = COLD_DRINK_CHANNEL,
    `T.Gallons` = Total_Gallons,
    `Gallons %` = PERCT_GAL,
    `T.Cost $` = COST_GAL,
    `N.Cust` = N_Customers,
    `P.Cust %` = Perct_Customers,
    `Avg.Qtd.Cust` = AVG_Qtd,
    `Median.Qtd.Cust` = Median_Qtd,
    `Avg.Cost.Cust $` = AVG_Cost_GAL,
    `Med.Cost.Cust $` = Median_Cost,
    `Opt.Cost $` = Opt_Cost
  ) %>%
  kable("html", escape = FALSE, align = "c", col.names = c("Channel", "T.Gallons", "Gallons %", "T.Cost $", "N.Cust", "P.Cust %", "Avg.Qtd.Cust", "Median.Qtd.Cust", "Avg.Cost.Cust $", "Med.Cost.Cust $", "Opt.Cost $")) %>%
  kable_styling(full_width = FALSE, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:11, width = "6em") %>%
  row_spec(0, bold = TRUE, color = "black", background = "#FFCCCB") %>%
  add_header_above(c("GALLONS (23 & 24) - Deliveries by Cold Drink Channel - All Customers" = 11)) %>%
  kable_paper("striped", full_width = FALSE)
GALLONS (23 & 24) - Deliveries by Cold Drink Channel - All Customers
Channel T.Gallons Gallons % T.Cost $ N.Cust P.Cust % Avg.Qtd.Cust Median.Qtd.Cust Avg.Cost.Cust $ Med.Cost.Cust $ Opt.Cost $
DINING 5,881,701 60.9 12,164,673 11,267 71.3 522 235.0 2.07 3.49 0.82
EVENT 1,802,976 18.7 1,711,570 1,473 9.3 1,224 287.5 0.95 2.95 0.39
BULK TRADE 631,817 6.5 532,474 464 2.9 1,362 347.5 0.84 2.84 0.39
PUBLIC SECTOR 460,586 4.8 711,092 635 4.0 725 210.0 1.54 3.73 0.69
WORKPLACE 258,877 2.7 436,789 682 4.3 380 177.5 1.69 3.15 0.41
WELLNESS 252,103 2.6 380,152 311 2.0 811 460.0 1.51 2.91 0.44
ACCOMMODATION 202,090 2.1 346,403 465 2.9 435 150.0 1.71 3.91 0.42
GOODS 165,540 1.7 417,493 490 3.1 338 182.5 2.52 4.62 0.69
CONVENTIONAL 4,502 0.0 43,144 15 0.1 300 135.0 9.58 19.77 0.72
Code
############

# Calculate Quartiles, Customer Count, and Volume Distribution
full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(
    # Store the total gallons per customer, excluding zero values
    Gallons_Per_Customer = list(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),
    .groups = 'drop'
  ) %>%
  mutate(
    # Calculate the average and median gallons per customer
    `Avg.Qtd.Cust` = sapply(Gallons_Per_Customer, function(x) mean(x[x > 0])),
    `Median.Qtd.Cust` = sapply(Gallons_Per_Customer, function(x) median(x[x > 0])),
    # Compute quartiles for quantity
    `1Quart.Qtd` = sapply(Gallons_Per_Customer, function(x) quantile(x[x > 0], 0.25)),
    `2Quart.Qtd` = sapply(Gallons_Per_Customer, function(x) quantile(x[x > 0], 0.50)),  # Median (Q2)
    `3Quart.Qtd` = sapply(Gallons_Per_Customer, function(x) quantile(x[x > 0], 0.75))
  ) %>%
  rowwise() %>%  # Ensure calculations are row-wise based on quartile values
  mutate(
    # Extract gallon values from the list
    Gallon_Values = list(unlist(Gallons_Per_Customer)),
    # Calculate total gallons volume per quartile using the correct conditions
    `1Quart.Vol` = sum(Gallon_Values[which(Gallon_Values > 0 & Gallon_Values <= `1Quart.Qtd`)]),
    `2Quart.Vol` = sum(Gallon_Values[which(Gallon_Values > `1Quart.Qtd` & Gallon_Values <= `2Quart.Qtd`)]),
    `3Quart.Vol` = sum(Gallon_Values[which(Gallon_Values > `2Quart.Qtd` & Gallon_Values <= `3Quart.Qtd`)]),
    `4Quart.Vol` = sum(Gallon_Values[which(Gallon_Values > `3Quart.Qtd`)]),
    # Calculate the total volume for the quartiles (1 to 4) in each channel
    Total_Vol = `1Quart.Vol` + `2Quart.Vol` + `3Quart.Vol` + `4Quart.Vol`,
    # Calculate percentages based on the sum of volumes from all quartiles for each channel
    `1Q.Vol%` = round((`1Quart.Vol` / Total_Vol) * 100, 1),
    `2Q.Vol%` = round((`2Quart.Vol` / Total_Vol) * 100, 1),
    `3Q.Vol%` = round((`3Quart.Vol` / Total_Vol) * 100, 1),
    `4Q.Vol%` = round((`4Quart.Vol` / Total_Vol) * 100, 1)
  ) %>%
  ungroup() %>%  # Remove row-wise grouping
  # Order by Avg.Qtd.Cust in descending order
  arrange(desc(`Avg.Qtd.Cust`)) %>%
  # Format numbers for readability
  mutate(
    `Avg.Qtd.Cust` = scales::comma(`Avg.Qtd.Cust`, accuracy = 1),
    `Median.Qtd.Cust` = scales::comma(`Median.Qtd.Cust`, accuracy = 1),
    `1Quart.Qtd` = scales::comma(`1Quart.Qtd`, accuracy = 1),
    `2Quart.Qtd` = scales::comma(`2Quart.Qtd`, accuracy = 1),
    `3Quart.Qtd` = scales::comma(`3Quart.Qtd`, accuracy = 1),
    `1Quart.Vol` = scales::comma(`1Quart.Vol`, accuracy = 1),
    `2Quart.Vol` = scales::comma(`2Quart.Vol`, accuracy = 1),
    `3Quart.Vol` = scales::comma(`3Quart.Vol`, accuracy = 1),
    `4Quart.Vol` = scales::comma(`4Quart.Vol`, accuracy = 1),
    `1Q.Vol%` = formatC(`1Q.Vol%`, format = "f", digits = 1),
    `2Q.Vol%` = formatC(`2Q.Vol%`, format = "f", digits = 1),
    `3Q.Vol%` = formatC(`3Q.Vol%`, format = "f", digits = 1),
    `4Q.Vol%` = formatC(`4Q.Vol%`, format = "f", digits = 1)
  ) %>%
  # Select only required columns
  dplyr::select(
    COLD_DRINK_CHANNEL, `Avg.Qtd.Cust`, `Median.Qtd.Cust`, `1Quart.Qtd`, `2Quart.Qtd`, `3Quart.Qtd`, 
    `1Quart.Vol`, `1Q.Vol%`, `2Quart.Vol`, `2Q.Vol%`, `3Quart.Vol`, `3Q.Vol%`, `4Quart.Vol`, `4Q.Vol%`
  ) %>%
  # Rename columns
  rename(
    `Channel` = COLD_DRINK_CHANNEL
  ) %>%
  kable("html", escape = FALSE, align = "c", col.names = c("Channel", "Avg.Qtd.Cust", "Median.Qtd.Cust", "1Quart.Qtd", "2Quart.Qtd", "3Quart.Qtd", "1Quart.Vol", "1Q.Vol%", "2Quart.Vol", "2Q.Vol%", "3Quart.Vol", "3Q.Vol%", "4Quart.Vol", "4Q.Vol%")) %>%
  kable_styling(full_width = FALSE, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:14, width = "6em") %>%
  row_spec(0, bold = TRUE, color = "black", background = "#FFCCCB") %>%
  add_header_above(c("GALLONS (23 & 24) - Quartile Analysis by Cold Drink Channel - All Customers" = 14)) %>%
  kable_paper("striped", full_width = FALSE)
GALLONS (23 & 24) - Quartile Analysis by Cold Drink Channel - All Customers
Channel Avg.Qtd.Cust Median.Qtd.Cust 1Quart.Qtd 2Quart.Qtd 3Quart.Qtd 1Quart.Vol 1Q.Vol% 2Quart.Vol 2Q.Vol% 3Quart.Vol 3Q.Vol% 4Quart.Vol 4Q.Vol%
BULK TRADE 1,362 348 114 348 1,057 4,870 0.8 25,440 4.0 71,161 11.3 530,345 83.9
EVENT 1,224 288 110 288 800 20,007 1.1 69,634 3.9 182,784 10.1 1,530,550 84.9
WELLNESS 811 460 158 460 919 5,927 2.4 20,742 8.2 52,658 20.9 172,777 68.5
PUBLIC SECTOR 725 210 79 210 554 6,269 1.4 22,497 4.9 54,900 11.9 376,920 81.8
DINING 522 235 88 235 585 121,015 2.1 429,705 7.3 1,082,575 18.4 4,248,406 72.2
ACCOMMODATION 435 150 55 150 518 3,231 1.6 11,618 5.7 35,067 17.4 152,175 75.3
WORKPLACE 380 177 85 177 370 8,121 3.1 21,469 8.3 44,312 17.1 184,975 71.5
GOODS 338 182 82 182 359 5,141 3.1 16,529 10.0 30,573 18.5 113,297 68.4
CONVENTIONAL 300 135 105 135 182 390 8.7 350 7.8 700 15.5 3,062 68.0

The tables above can be used for different analyses, which will not be discussed here. It is worth highlighting that the dining segment has an average consumption of 522 gallons and a median of 235, resulting in a smaller cost difference when compared to the impact of cases for the bulk trade sector.

Code
# Calculate the mean and median for the "DINING" channel, without creating a permanent column
mean_value <- mean(
  (full_data_customer$QTD_DLV_GAL_2023 + full_data_customer$QTD_DLV_GAL_2024)[
    full_data_customer$COLD_DRINK_CHANNEL == "DINING" & 
    (full_data_customer$QTD_DLV_GAL_2023 + full_data_customer$QTD_DLV_GAL_2024) > 0
  ], na.rm = TRUE
)

median_value <- median(
  (full_data_customer$QTD_DLV_GAL_2023 + full_data_customer$QTD_DLV_GAL_2024)[
    full_data_customer$COLD_DRINK_CHANNEL == "DINING" & 
    (full_data_customer$QTD_DLV_GAL_2023 + full_data_customer$QTD_DLV_GAL_2024) > 0
  ], na.rm = TRUE
)

# Filter data for the "DINING" channel, exclude zero sums, and plot the histogram
full_data_customer %>%
  filter(COLD_DRINK_CHANNEL == "DINING" & 
         (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024) > 0) %>%  # Filter for "DINING" and total_gallons > 0
  mutate(total_gallons = QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024) %>%  # Temporarily create 'total_gallons'
  ggplot(aes(x = total_gallons)) +
  # Dark gray bars with no border
  geom_histogram(binwidth = 0.5, fill = "darkgray", color = "darkgray", alpha = 0.7) +  
  # Line for mean
  geom_vline(aes(xintercept = mean_value, color = "Mean"), 
             linetype = "solid", size = 0.6) +  # Line for mean
  # Line for median
  geom_vline(aes(xintercept = median_value, color = "Median"), 
             linetype = "solid", size = 0.6) +  # Line for median
  # Customize colors and legend position
  scale_color_manual(values = c("Mean" = "blue", "Median" = "coral"),
                     labels = c(paste("Mean:", round(mean_value, 0)),
                                paste("Median:", round(median_value, 0)))) +
  labs(
    title = "Total Gallons Delivered for Dining Channel",
    subtitle = "(Limited to a Maximum of 5000)",
    x = "Total Gallons Delivered",
    y = "Number of Customers"
  ) +
  xlim(0, 5000) +  # Limit x-axis to 5000
  theme_minimal() +  # Use a minimal theme
  theme(
    panel.grid.major.y = element_line(color = "gray", size = 0.5),  # Add horizontal grid lines
    panel.grid = element_blank(),  # Remove vertical grid lines
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 12, face = "italic"),  # Style the subtitle
    legend.position = "right",  # Move the legend to the right
    legend.title = element_blank(),  # Remove the title from the legend
    legend.key = element_blank()  # Remove the background of the legend
  )

4.5.3 Cold Drink Channel - Delivered Gallons for Local Market Partners Fountain Only

Code
# Summarize data by COLD_DRINK_CHANNEL, summing the quantities of gallons (QTD_DLV_GAL_2023 and QTD_DLV_GAL_2024), and filter by LOCAL_FOUNT_ONLY == 1
data_summary_gallons <- full_data_customer %>%
  filter(LOCAL_FOUNT_ONLY == 1) %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(
    Total_Gallons = sum(QTD_DLV_GAL_2023, na.rm = TRUE) + sum(QTD_DLV_GAL_2024, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(Percentage = round(Total_Gallons / sum(Total_Gallons) * 100, 1))  # Calculate the percentage

# Create a bar chart for the percentage of total gallons by cold drink channel
ggplot(data_summary_gallons, aes(x = Total_Gallons, y = reorder(COLD_DRINK_CHANNEL, Total_Gallons), fill = COLD_DRINK_CHANNEL)) +
  geom_bar(stat = "identity", position = "stack", alpha = 0.5) +  
  geom_text(aes(label = paste(Percentage, "%")), position = position_stack(vjust = 0.5), 
            hjust = -0.01, color = "black", size = 3.2) +
  labs(title = "Local Fountain Only - Percentage of Gallons (23 & 24) by Cold Drink Channel",
       x = "Percentage of Total Gallons", 
       y = NULL) +  
  scale_x_continuous(labels = scales::percent_format(scale = 1), expand = expansion(c(0, 0.05))) +  
  scale_fill_manual(values = cold_drink_channel_colors) +  # Apply the custom color palette
  theme_minimal() +  
  theme(plot.title = element_text(size = 10, face = "bold")) +  
  theme(axis.text.y = element_text(size = 10),  
        axis.title.x = element_blank(),   # Remove the x-axis title
        axis.text.x = element_blank(),    # Remove the x-axis text
        legend.position = "none",  # Remove the legend
        panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank())

Code
# # Summarize data by COLD_DRINK_CHANNEL, summing the delivery cost for gallons (COST_GAL_23 and COST_GAL_24), 
# # and filter by LOCAL_FOUNT_ONLY == 1
# data_summary_gallons_cost <- full_data_customer %>%
#   filter(LOCAL_FOUNT_ONLY == 1) %>%
#   group_by(COLD_DRINK_CHANNEL) %>%
#   summarise(
#     Total_Gallons_Cost = sum(COST_GAL_23, na.rm = TRUE) + sum(COST_GAL_24, na.rm = TRUE),
#     .groups = 'drop'
#   ) %>%
#   mutate(Percentage = round(Total_Gallons_Cost / sum(Total_Gallons_Cost) * 100, 1))  # Calculate the percentage

# # Create a bar chart for the percentage of total gallons cost by cold drink channel for LOCAL_FOUNT_ONLY
# ggplot(data_summary_gallons_cost, aes(x = Total_Gallons_Cost, y = reorder(COLD_DRINK_CHANNEL, Total_Gallons_Cost), fill = COLD_DRINK_CHANNEL)) +
#   geom_bar(stat = "identity", position = "stack", alpha = 0.5) +  
#   geom_text(aes(label = paste(Percentage, "%")), position = position_stack(vjust = 0.5), 
#             hjust = -0.01, color = "black", size = 3.2) +
#   labs(title = "Local Fountain Only - Percentage of Gallons Delivery Cost (23 & 24) by Cold Drink Channel",
#        x = "Percentage of Total Gallons Cost", 
#        y = NULL) +  
#   scale_x_continuous(labels = scales::percent_format(scale = 1), expand = expansion(c(0, 0.05))) +  
#   scale_fill_manual(values = cold_drink_channel_colors) +  # Apply the custom color palette
#   theme_minimal() +  
#   theme(plot.title = element_text(size = 10, face = "bold")) +  
#   theme(axis.text.y = element_text(size = 10),  
#         axis.title.x = element_blank(),   # Remove the x-axis title
#         axis.text.x = element_blank(),    # Remove the x-axis text
#         legend.position = "none",  # Remove the legend
#         panel.grid.major = element_blank(),  
#         panel.grid.minor = element_blank())

Among the local drink-only customers, nearly 90% of the demand is represented by the dining segment, followed by event at 4.5% and workplace at 3.5%. Costs followed nearly the same proportions and were therefore not displayed.

The tables below aim to provide detailed information within this group regarding the number of customers, costs, quartile divisions, and other relevant factors.

Code
# Calculate Total Gallons, COST_GAL, N.Customers, Perct.Customers, AVG_Qtd, and Median.Qtd, then order and format the table
full_data_customer %>%
  filter(LOCAL_FOUNT_ONLY == 1) %>%  
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(
    Total_Gallons = sum(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),
    COST_GAL = sum(COST_GAL_23) + sum(COST_GAL_24),
    # Count only customers where Total_Gallons > 0
    N_Customers = n_distinct(CUSTOMER_NUMBER[QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024 > 0]), 
    # Calculate the total gallons per customer, excluding customers with zero total gallons
    Total_Gallons_Per_Customer = list(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),
    Total_Cost_Per_Customer = list(COST_GAL_23 + COST_GAL_24),
    .groups = 'drop'
  ) %>%
  mutate(
    # Calculate the average COST_GAL per Total_Gallons
    AVG_Cost_GAL = COST_GAL / Total_Gallons,  
    # Calculate the percentage of total gallons
    PERCT_GAL = round(Total_Gallons / sum(Total_Gallons) * 100, 1),
    # Calculate the percentage of total customers
    Perct_Customers = round(N_Customers / sum(N_Customers) * 100, 1), # Calculate percentage of customers
    # Calculate the average gallons per customer (without decimals)
    AVG_Qtd = round(Total_Gallons / N_Customers),  # No decimals for AVG_Qtd
    # Calculate the median of gallons per customer, excluding customers with zero gallons
    Median_Qtd = sapply(Total_Gallons_Per_Customer, function(x) {
      median(x[x > 0], na.rm = TRUE)  # Only consider positive gallons for the median
    }),
    # Calculate the median cost per gallon for each cold drink channel, excluding customers with zero gallons
    Median_Cost = sapply(1:length(Total_Gallons_Per_Customer), function(i) {
      total_cost <- Total_Cost_Per_Customer[[i]]
      total_gallons <- Total_Gallons_Per_Customer[[i]]
      median(total_cost[total_gallons > 0] / total_gallons[total_gallons > 0], na.rm = TRUE)  # Median cost per gallon
    })
  ) %>%
  # Order by Total Gallons in descending order (before formatting)
  arrange(desc(Total_Gallons)) %>%
  # Calculate Opt_Cost for each COLD_DRINK_CHANNEL based on minimum median delivery cost for GALLONS
  left_join(
    cost_data %>%
      filter(grepl("GALLONS", as.character(`RANGE_LEVEL`))) %>%
      group_by(COLD_DRINK_CHANNEL) %>%
      summarise(Opt_Cost = round(min(`Median Delivery Cost`), 2)) %>%
      ungroup(),  # Ensures only 1 line per COLD_DRINK_CHANNEL
    by = "COLD_DRINK_CHANNEL"
  ) %>%
  # Format COST_GAL, Total_Gallons, AVG_Cost_GAL, N_Customers, Perct_Customers, AVG_Qtd, Median.Qtd, and Median.Cost after ordering
  mutate(
    COST_GAL = scales::comma(COST_GAL),  
    Total_Gallons = scales::comma(Total_Gallons),  
    AVG_Cost_GAL = scales::comma(AVG_Cost_GAL, accuracy = 0.01),
    N_Customers = scales::comma(N_Customers), # Format N_Customers
    PERCT_GAL = sprintf("%.1f", PERCT_GAL),  # Ensure 1 decimal place for percentage
    Perct_Customers = sprintf("%.1f", Perct_Customers),  # Ensure 1 decimal place for percentage
    AVG_Qtd = scales::comma(AVG_Qtd),  # Format AVG_Qtd
    Median_Qtd = scales::comma(Median_Qtd),  # Format Median_Qtd
    Median_Cost = scales::comma(Median_Cost, accuracy = 0.01),  # Format Median_Cost
    Opt_Cost = scales::comma(Opt_Cost, accuracy = 0.01)  # Format Opt_Cost
  ) %>%
  # Select columns in the correct order with exact column names
  dplyr::select(
    COLD_DRINK_CHANNEL, Total_Gallons, PERCT_GAL, COST_GAL, N_Customers, Perct_Customers, AVG_Qtd, Median_Qtd, AVG_Cost_GAL, Median_Cost, Opt_Cost
  ) %>%
  # Rename columns to match the desired output
  rename(
    `Channel` = COLD_DRINK_CHANNEL,
    `T.Gallons` = Total_Gallons,
    `Gallons %` = PERCT_GAL,
    `T.Cost $` = COST_GAL,
    `N.Cust` = N_Customers,
    `P.Cust %` = Perct_Customers,
    `Avg.Qtd.Cust` = AVG_Qtd,
    `Median.Qtd.Cust` = Median_Qtd,
    `Avg.Cost.Cust $` = AVG_Cost_GAL,
    `Med.Cost.Cust $` = Median_Cost,
    `Opt.Cost $` = Opt_Cost
  ) %>%
  kable("html", escape = FALSE, align = "c", col.names = c("Channel", "T.Gallons", "Gallons %", "T.Cost $", "N.Cust", "P.Cust %", "Avg.Qtd.Cust", "Median.Qtd.Cust", "Avg.Cost.Cust $", "Med.Cost.Cust $", "Opt.Cost $")) %>%
  kable_styling(full_width = FALSE, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:11, width = "6em") %>%
  row_spec(0, bold = TRUE, color = "black", background = "darkorange") %>%
  add_header_above(c("GALLONS (23 & 24) - Deliveries by Cold Drink Channel - Local Fountain Only" = 11)) %>%
  kable_paper("striped", full_width = FALSE)
GALLONS (23 & 24) - Deliveries by Cold Drink Channel - Local Fountain Only
Channel T.Gallons Gallons % T.Cost $ N.Cust P.Cust % Avg.Qtd.Cust Median.Qtd.Cust Avg.Cost.Cust $ Med.Cost.Cust $ Opt.Cost $
DINING 510,335 89.0 1,091,574 1,150 84.6 444 177.5 2.14 3.98 0.82
EVENT 25,774 4.5 50,622 68 5.0 379 121.2 1.96 3.76 0.39
WORKPLACE 20,029 3.5 30,279 63 4.6 318 116.6 1.51 3.15 0.41
WELLNESS 4,855 0.8 9,603 9 0.7 539 230.0 1.98 3.34 0.44
GOODS 4,759 0.8 15,484 25 1.8 190 106.7 3.25 4.62 0.69
PUBLIC SECTOR 3,655 0.6 12,938 26 1.9 141 81.2 3.54 3.73 0.69
ACCOMMODATION 3,135 0.5 8,055 13 1.0 241 102.5 2.57 3.91 0.42
BULK TRADE 772 0.1 1,902 5 0.4 154 125.0 2.46 2.85 0.39
Code
#######

# Calculate Quartiles, Customer Count, and Volume Distribution
full_data_customer %>%
  filter(LOCAL_FOUNT_ONLY == 1) %>% 
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(
    # Store the total gallons per customer, excluding zero values
    Gallons_Per_Customer = list(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),
    .groups = 'drop'
  ) %>%
  mutate(
    # Calculate the average and median gallons per customer
    `Avg.Qtd.Cust` = sapply(Gallons_Per_Customer, function(x) mean(x[x > 0])),
    `Median.Qtd.Cust` = sapply(Gallons_Per_Customer, function(x) median(x[x > 0])),
    # Compute quartiles for quantity
    `1Quart.Qtd` = sapply(Gallons_Per_Customer, function(x) quantile(x[x > 0], 0.25)),
    `2Quart.Qtd` = sapply(Gallons_Per_Customer, function(x) quantile(x[x > 0], 0.50)),  # Median (Q2)
    `3Quart.Qtd` = sapply(Gallons_Per_Customer, function(x) quantile(x[x > 0], 0.75))
  ) %>%
  rowwise() %>%  # Ensure calculations are row-wise based on quartile values
  mutate(
    # Extract gallon values from the list
    Gallon_Values = list(unlist(Gallons_Per_Customer)),
    # Calculate total gallons volume per quartile using the correct conditions
    `1Quart.Vol` = sum(Gallon_Values[which(Gallon_Values > 0 & Gallon_Values <= `1Quart.Qtd`)]),
    `2Quart.Vol` = sum(Gallon_Values[which(Gallon_Values > `1Quart.Qtd` & Gallon_Values <= `2Quart.Qtd`)]),
    `3Quart.Vol` = sum(Gallon_Values[which(Gallon_Values > `2Quart.Qtd` & Gallon_Values <= `3Quart.Qtd`)]),
    `4Quart.Vol` = sum(Gallon_Values[which(Gallon_Values > `3Quart.Qtd`)]),
    # Calculate the total volume for the quartiles (1 to 4) in each channel
    Total_Vol = `1Quart.Vol` + `2Quart.Vol` + `3Quart.Vol` + `4Quart.Vol`,
    # Calculate percentages based on the sum of volumes from all quartiles for each channel
    `1Q.Vol%` = round((`1Quart.Vol` / Total_Vol) * 100, 1),
    `2Q.Vol%` = round((`2Quart.Vol` / Total_Vol) * 100, 1),
    `3Q.Vol%` = round((`3Quart.Vol` / Total_Vol) * 100, 1),
    `4Q.Vol%` = round((`4Quart.Vol` / Total_Vol) * 100, 1)
  ) %>%
  ungroup() %>%  # Remove row-wise grouping
  # Order by Avg.Qtd.Cust in descending order
  arrange(desc(`Avg.Qtd.Cust`)) %>%
  # Format numbers for readability
  mutate(
    `Avg.Qtd.Cust` = scales::comma(`Avg.Qtd.Cust`, accuracy = 1),
    `Median.Qtd.Cust` = scales::comma(`Median.Qtd.Cust`, accuracy = 1),
    `1Quart.Qtd` = scales::comma(`1Quart.Qtd`, accuracy = 1),
    `2Quart.Qtd` = scales::comma(`2Quart.Qtd`, accuracy = 1),
    `3Quart.Qtd` = scales::comma(`3Quart.Qtd`, accuracy = 1),
    `1Quart.Vol` = scales::comma(`1Quart.Vol`, accuracy = 1),
    `2Quart.Vol` = scales::comma(`2Quart.Vol`, accuracy = 1),
    `3Quart.Vol` = scales::comma(`3Quart.Vol`, accuracy = 1),
    `4Quart.Vol` = scales::comma(`4Quart.Vol`, accuracy = 1),
    `1Q.Vol%` = formatC(`1Q.Vol%`, format = "f", digits = 1),
    `2Q.Vol%` = formatC(`2Q.Vol%`, format = "f", digits = 1),
    `3Q.Vol%` = formatC(`3Q.Vol%`, format = "f", digits = 1),
    `4Q.Vol%` = formatC(`4Q.Vol%`, format = "f", digits = 1)
  ) %>%
  # Select only required columns
  dplyr::select(
    COLD_DRINK_CHANNEL, `Avg.Qtd.Cust`, `Median.Qtd.Cust`, `1Quart.Qtd`, `2Quart.Qtd`, `3Quart.Qtd`, 
    `1Quart.Vol`, `1Q.Vol%`, `2Quart.Vol`, `2Q.Vol%`, `3Quart.Vol`, `3Q.Vol%`, `4Quart.Vol`, `4Q.Vol%`
  ) %>%
  # Rename columns
  rename(
    `Channel` = COLD_DRINK_CHANNEL
  ) %>%
  kable("html", escape = FALSE, align = "c", col.names = c("Channel", "Avg.Qtd.Cust", "Median.Qtd.Cust", "1Quart.Qtd", "2Quart.Qtd", "3Quart.Qtd", "1Quart.Vol", "1Q.Vol%", "2Quart.Vol", "2Q.Vol%", "3Quart.Vol", "3Q.Vol%", "4Quart.Vol", "4Q.Vol%")) %>%
  kable_styling(full_width = FALSE, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:14, width = "6em") %>%
  row_spec(0, bold = TRUE, color = "black", background = "darkorange") %>%
  add_header_above(c("GALLONS (23 & 24) - Quartile Analysis by Cold Drink Channel - Local Fountain Only" = 14)) %>%
  kable_paper("striped", full_width = FALSE)
GALLONS (23 & 24) - Quartile Analysis by Cold Drink Channel - Local Fountain Only
Channel Avg.Qtd.Cust Median.Qtd.Cust 1Quart.Qtd 2Quart.Qtd 3Quart.Qtd 1Quart.Vol 1Q.Vol% 2Quart.Vol 2Q.Vol% 3Quart.Vol 3Q.Vol% 4Quart.Vol 4Q.Vol%
WELLNESS 539 230 100 230 998 142 2.9 360 7.4 1,362 28.1 2,990 61.6
DINING 444 178 58 178 481 8,472 1.7 30,716 6.0 87,949 17.2 383,197 75.1
EVENT 379 121 39 121 466 358 1.4 1,282 5.0 5,242 20.3 18,891 73.3
WORKPLACE 318 117 55 117 232 449 2.2 1,357 6.8 2,331 11.6 15,892 79.3
ACCOMMODATION 241 102 85 102 323 255 8.1 198 6.3 833 26.6 1,850 59.0
GOODS 190 107 30 107 222 152 3.2 454 9.5 860 18.1 3,292 69.2
BULK TRADE 154 125 82 125 235 87 11.3 125 16.2 235 30.5 325 42.1
PUBLIC SECTOR 141 81 38 81 195 198 5.4 325 8.9 782 21.4 2,350 64.3

The tables above can be used for different analyses, which will not be discussed here. It is worth highlighting that among the local market partners (fountain only), the average consumption was 444 gallons and the median was 177, resulting in an average cost of $2.14 per gallon, which is nearly half of the cost per gallon for customers, which is $3.98.

4.6 Trade Channel

Code
# Summarize data by TRADE_CHANNEL, summing the quantities of gallons and cases
data_summary_trade_channel <- full_data_customer %>%
  group_by(TRADE_CHANNEL) %>%
  summarise(
    Total_Volume = sum(QTD_DLV_GAL_2023, na.rm = TRUE) + sum(QTD_DLV_GAL_2024, na.rm = TRUE) +
                   sum(QTD_DLV_CA_2023, na.rm = TRUE) + sum(QTD_DLV_CA_2024, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(Percentage = round(Total_Volume / sum(Total_Volume) * 100, 1))

# Define a dynamic color palette to handle more than 9 categories
num_colors <- length(unique(data_summary_trade_channel$TRADE_CHANNEL))
custom_palette <- setNames(colorRampPalette(brewer.pal(9, "Set2"))(num_colors),
                           unique(data_summary_trade_channel$TRADE_CHANNEL))

# Create a horizontal bar chart for the percentage of total volume by trade channel
ggplot(data_summary_trade_channel, aes(x = Total_Volume / 1e6, y = reorder(TRADE_CHANNEL, Total_Volume), fill = TRADE_CHANNEL)) +
  geom_bar(stat = "identity", position = "stack", alpha = 0.7) +
  geom_text(aes(label = paste(Percentage, "%")), position = position_stack(vjust = 0.5),
            hjust = -0.01, color = "black", size = 3.2) +
  labs(title = "Percentage of Total Volume (Gallons and Cases) by Trade Channel",
       x = "Quantity in Millions",
       y = NULL) +
  scale_x_continuous(
    limits = c(0, 7.5),
    breaks = c(2.5, 5, 7.5),
    labels = function(x) paste0(x, "M"),
    expand = expansion(c(0, 0.05))
  ) +
  geom_vline(xintercept = c(2.5, 5, 7.5), color = "lightgray", linetype = "solid", linewidth = 0.3) +
  scale_fill_manual(values = custom_palette) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10),
    axis.title.x = element_text(size = 10, face = "bold"),
    legend.position = "none",
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )

Among the trade channels, Fast Casual Dining (19%), Comprehensive Dining (13.4%), and Travel (12%) rank among the top five in terms of total volume demand. These are also the only segments that individually represent more than 10% of the total volume.

4.7 Sub Trade Channel

The sub trade channel consists of 48 classes, so we decided to create a table for reference and queries.

Code
# Create a summary table for the frequency of each unique value in SUB_TRADE_CHANNEL
data_summary_sub_trade_channel <- profile_data %>%
  group_by(SUB_TRADE_CHANNEL) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = round(Count / sum(Count) * 100, 1))  

# Display the interactive table with DT
datatable(data_summary_sub_trade_channel, 
          options = list(pageLength = 5, autoWidth = TRUE, dom = 'Bfrtip', 
                         buttons = c('copy', 'csv', 'excel', 'pdf')))

4.8 CO2 Customers

Code
# Calculate percentages for CO2_CUSTOMER
co2_customer_summary <- profile_data %>%
  group_by(CO2_CUSTOMER) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = round(Count / sum(Count) * 100, 1))

# Create the plot
ggplot(co2_customer_summary, aes(x = CO2_CUSTOMER, y = Percentage, fill = as.factor(CO2_CUSTOMER))) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.6) +
  geom_text(aes(label = paste0(Percentage, "%")), 
            position = position_dodge(width = 0.8), vjust = 0.2, size = 3.5) +
  labs(title = "Percentage Breakdown by CO2 Customers Status") +
  scale_fill_manual(values = c("0" = "#8ED081", "1" = "#A7ADC6"), 
                    labels = c("Non-CO2 Customers", "CO2 Customers")) +
  scale_y_continuous(labels = percent_format(scale = 1)) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    legend.title = element_blank(),
    legend.position = "right",  # Position the legend to the right
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.text.x = element_text(size = 10),
    axis.ticks.y = element_blank()
  ) +
  scale_x_discrete(labels = c("0" = "Non-CO2 Customers", "1" = "CO2 Customers"))

Around 61% of customers do not consume CO2, including all local market partners. However, we still find that the percentage of customers consuming CO2 is relatively high, at nearly 39%.

4.9 Transactions by Cases

Code
# Create the summary table with adjusted minimum value and median considering values > 0
summary_cases <- data.frame(
  type = c("ORDERED_CASES", "LOADED_CASES", "DELIVERED_CASES", "RETURNED_CASES"),
  
  # Calculating the minimum value considering values > 0 and rounding to four decimal places
  min = c(
    round(min(op_data$ORDERED_CASES[op_data$ORDERED_CASES > 0]), 4),
    round(min(op_data$LOADED_CASES[op_data$LOADED_CASES > 0]), 4),
    round(min(op_data$DELIVERED_CASES[op_data$DELIVERED_CASES > 0]), 4),
    round(min(op_data$RETURNED_CASES[op_data$RETURNED_CASES > 0]), 4)
  ),
  
  # Median calculation considering only values greater than 0
  median = c(
    median(op_data$ORDERED_CASES[op_data$ORDERED_CASES > 0]),
    median(op_data$LOADED_CASES[op_data$LOADED_CASES > 0]),
    median(op_data$DELIVERED_CASES[op_data$DELIVERED_CASES > 0]),
    median(op_data$RETURNED_CASES[op_data$RETURNED_CASES > 0])
  ),
  
  # Maximum without decimal places
  max = c(
    floor(max(op_data$ORDERED_CASES)),
    floor(max(op_data$LOADED_CASES)),
    floor(max(op_data$DELIVERED_CASES)),
    floor(max(op_data$RETURNED_CASES))
  ),
  
  # Sum with thousands separator
  sum_qtd = c(
    format(sum(op_data$ORDERED_CASES), big.mark = ","),
    format(sum(op_data$LOADED_CASES), big.mark = ","),
    format(sum(op_data$DELIVERED_CASES), big.mark = ","),
    format(sum(op_data$RETURNED_CASES), big.mark = ",")
  ),
  
  # Number of transactions with thousands separator
  num_trans = c(
    format(sum(op_data$ORDERED_CASES > 0), big.mark = ","),
    format(sum(op_data$LOADED_CASES > 0), big.mark = ","),
    format(sum(op_data$DELIVERED_CASES > 0), big.mark = ","),
    format(sum(op_data$RETURNED_CASES > 0), big.mark = ",")
  ),
  
  # Average quantity per transaction without decimals
  avg_qtd_by_trans = c(
    round(sum(op_data$ORDERED_CASES) / max(1, sum(op_data$ORDERED_CASES > 0))),
    round(sum(op_data$LOADED_CASES) / max(1, sum(op_data$LOADED_CASES > 0))),
    round(sum(op_data$DELIVERED_CASES) / max(1, sum(op_data$DELIVERED_CASES > 0))),
    round(sum(op_data$RETURNED_CASES) / max(1, sum(op_data$RETURNED_CASES > 0)))
  )
)

# Create the table using kableExtra for better formatting
summary_cases %>%
  kable("html", escape = FALSE, align = "c") %>%
  kable_styling(full_width = F, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:7, width = "6em") %>%
  row_spec(0, bold = TRUE, color = "black", background = "#ADD8E6") %>%  # Light blue header
  add_header_above(c("CASES - Statistics by transactions greater than 0" = 7)) %>%
  kable_paper("striped", full_width = F)
CASES - Statistics by transactions greater than 0
type min median max sum_qtd num_trans avg_qtd_by_trans
ORDERED_CASES 0.0898 11.5 8479 28,074,470 772,877 36
LOADED_CASES 0.0898 11.0 8171 27,103,098 770,624 35
DELIVERED_CASES 0.0001 11.0 8069 26,434,079 750,872 35
RETURNED_CASES 0.0390 8.0 3132 156,165 2,582 60

Considering all case transactions, we created the table above to generate some key metrics. The values for ORDERED CASES, LOADED CASES, and DELIVERED CASES are similar, as expected. There are records with quantities less than 1 unit, and the maximum values exceed 8,000 cases, with the average per transaction being approximately 35 cases.

The number of transactions for RETURNED CASES is much smaller, but there was a return of 3,132 cases. The average number of cases per transaction is 60.

Code
# Transforming the data to long format
op_data_long <- op_data %>%
  dplyr::select(ORDERED_CASES, LOADED_CASES, DELIVERED_CASES) %>%
  pivot_longer(cols = everything(), names_to = "case_type", values_to = "count") %>%
  mutate(case_type = factor(case_type, levels = c("ORDERED_CASES", "LOADED_CASES", "DELIVERED_CASES")))

# Define border colors based on case_type
border_colors <- c("ORDERED_CASES" = "grey", 
                   "LOADED_CASES" = "lightblue", 
                   "DELIVERED_CASES" = "darkblue")

# Plot with histograms
ggplot(op_data_long, aes(x = count)) +
  geom_histogram(binwidth = 1, 
                 aes(fill = case_type, color = case_type), 
                 alpha = 0.7) +
  facet_wrap(~case_type, scales = "free_x", nrow = 1, 
             labeller = as_labeller(c("ORDERED_CASES" = "Ordered", 
                                      "LOADED_CASES" = "Loaded", 
                                      "DELIVERED_CASES" = "Delivered"))) +
  scale_y_continuous(trans = 'log10', 
                     breaks = scales::trans_breaks("log10", function(x) 10^x), 
                     labels = scales::trans_format("log10", scales::math_format(10^.x))) +
  scale_x_continuous(limits = c(0, 5000)) +  # Limit x-axis to 5000
  scale_color_manual(values = border_colors) +  
  theme_minimal() +
  labs(title = "Histograms of Case Counts", x = "Case Count", y = "Frequency (Log Scale)") +
  theme(
    strip.background = element_blank(),  
    strip.text = element_text(color = "black", size = 9),  
    panel.grid.major.x = element_blank(),  
    panel.grid.minor = element_blank(),  
    panel.grid.major.y = element_line(color = "grey", size = 0.5),  
    axis.title = element_text(size = 7),
    axis.text = element_text(size = 6),
    plot.title = element_text(size = 10, face = "bold", hjust = 0.5),  
    strip.text.x = element_text(size = 8, hjust = 0.5),  
    legend.position = "none",  
    axis.text.y = element_text(size = 7),  
    axis.title.y = element_text(size = 8),  
    panel.spacing = unit(1, "lines")  
  )

Above, we have the histogram of transactions related to case counts. We have limited the visualization to 5000 cases and applied a logarithmic scale for better interpretation. It is noticeable that the number of transactions decreases near 1900 cases and then increases again around 2000. This could potentially correlate with the larger clients.

Below is the histogram of returned cases, where it is evident that the number of transactions is relatively low, with quantities generally not exceeding 250 cases. There are some transactions exceeding 1,000 cases, but they are rare. These were excluded to make the chart more interpretable.

Code
# Transforming the data to long format for RETURNED_CASES
op_data_long_returned <- op_data %>%
  dplyr::select(RETURNED_CASES) %>%
  pivot_longer(cols = everything(), names_to = "case_type", values_to = "count") %>%
  mutate(case_type = factor(case_type, levels = c("RETURNED_CASES")))

# Define border colors for RETURNED_CASES
border_colors_returned <- c("RETURNED_CASES" = "black")

# Plot with histogram for RETURNED_CASES
ggplot(op_data_long_returned, aes(x = count)) +
  geom_histogram(binwidth = 1, 
                 aes(fill = case_type, color = case_type), 
                 alpha = 0.7) +
  scale_x_continuous(limits = c(0, 1000)) +  # Set max limit for x-axis
  scale_y_continuous(trans = 'log10', 
                     breaks = scales::trans_breaks("log10", function(x) 10^x), 
                     labels = scales::trans_format("log10", scales::math_format(10^.x))) +
  scale_color_manual(values = border_colors_returned) +  
  theme_minimal() +
  labs(title = "Returned Case Counts", x = "Case Count", y = "Frequency (Log Scale)") +
  theme(
    strip.background = element_blank(),  
    strip.text = element_text(color = "black", size = 9),  
    panel.grid.major.x = element_blank(),  
    panel.grid.minor = element_blank(),  
    panel.grid.major.y = element_line(color = "grey", size = 0.5),  
    axis.title = element_text(size = 7),
    axis.text = element_text(size = 6),
    plot.title = element_text(size = 10, face = "bold", hjust = 0.5),  
    strip.text.x = element_text(size = 8, hjust = 0.5),  
    legend.position = "none",  
    axis.text.y = element_text(size = 7),  
    axis.title.y = element_text(size = 8),  
    panel.spacing = unit(1, "lines")  
  )

4.10 Transactions by Gallons

Code
# Create the summary table with adjusted minimum value and median considering values > 0 for GALLONS
summary_gallons <- data.frame(
  type = c("ORDERED_GALLONS", "LOADED_GALLONS", "DELIVERED_GALLONS", "RETURNED_GALLONS"),
  
  # Calculating the minimum value considering values > 0 and rounding to four decimal places
  min = c(
    round(min(op_data$ORDERED_GALLONS[op_data$ORDERED_GALLONS > 0]), 4),
    round(min(op_data$LOADED_GALLONS[op_data$LOADED_GALLONS > 0]), 4),
    round(min(op_data$DELIVERED_GALLONS[op_data$DELIVERED_GALLONS > 0]), 4),
    round(min(op_data$RETURNED_GALLONS[op_data$RETURNED_GALLONS > 0]), 4)
  ),
  
  # Median calculation considering only values greater than 0
  median = c(
    median(op_data$ORDERED_GALLONS[op_data$ORDERED_GALLONS > 0]),
    median(op_data$LOADED_GALLONS[op_data$LOADED_GALLONS > 0]),
    median(op_data$DELIVERED_GALLONS[op_data$DELIVERED_GALLONS > 0]),
    median(op_data$RETURNED_GALLONS[op_data$RETURNED_GALLONS > 0])
  ),
  
  # Maximum without decimal places
  max = c(
    floor(max(op_data$ORDERED_GALLONS)),
    floor(max(op_data$LOADED_GALLONS)),
    floor(max(op_data$DELIVERED_GALLONS)),
    floor(max(op_data$RETURNED_GALLONS))
  ),
  
  # Sum with no decimal places
  sum_qtd = c(
    format(floor(sum(op_data$ORDERED_GALLONS)), big.mark = ","),
    format(floor(sum(op_data$LOADED_GALLONS)), big.mark = ","),
    format(floor(sum(op_data$DELIVERED_GALLONS)), big.mark = ","),
    format(floor(sum(op_data$RETURNED_GALLONS)), big.mark = ",")
  ),
  
  # Number of transactions with thousands separator
  num_trans = c(
    format(sum(op_data$ORDERED_GALLONS > 0), big.mark = ","),
    format(sum(op_data$LOADED_GALLONS > 0), big.mark = ","),
    format(sum(op_data$DELIVERED_GALLONS > 0), big.mark = ","),
    format(sum(op_data$RETURNED_GALLONS > 0), big.mark = ",")
  ),
  
  # Average quantity per transaction without decimals
  avg_qtd_by_trans = c(
    round(sum(op_data$ORDERED_GALLONS) / max(1, sum(op_data$ORDERED_GALLONS > 0))),
    round(sum(op_data$LOADED_GALLONS) / max(1, sum(op_data$LOADED_GALLONS > 0))),
    round(sum(op_data$DELIVERED_GALLONS) / max(1, sum(op_data$DELIVERED_GALLONS > 0))),
    round(sum(op_data$RETURNED_GALLONS) / max(1, sum(op_data$RETURNED_GALLONS > 0)))
  )
)

# Create the table using kableExtra for better formatting
summary_gallons %>%
  kable("html", escape = FALSE, align = "c") %>%
  kable_styling(full_width = F, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:7, width = "6em") %>%
  row_spec(0, bold = TRUE, color = "black", background = "#FFCCCB") %>%  # Light blue header
  add_header_above(c("GALLONS - Statistics by transactions greater than 0" = 7)) %>%
  kable_paper("striped", full_width = F)
GALLONS - Statistics by transactions greater than 0
type min median max sum_qtd num_trans avg_qtd_by_trans
ORDERED_GALLONS 0.0898 15.0 2562 10,323,336 482,518 21
LOADED_GALLONS 0.0898 15.0 2562 10,042,299 479,599 21
DELIVERED_GALLONS 0.0159 15.0 2292 9,660,192 464,231 21
RETURNED_GALLONS 0.0156 7.5 1792 32,513 1,760 18

The values for ORDERED GALLONS, LOADED GALLONS, and DELIVERED GALLONS are similar, as expected. There are records with quantities less than 1 unit, and the maximum values exceed 2,200 gallons, with the average per transaction being approximately 21 gallons.

The number of gallon transactions is significantly lower than that of cases, at about 60%.

The number of transactions for RETURNED GALLONS is much smaller, but there was a return of 1,792 gallons. The average number of gallons per transaction is 18.

Code
# Transforming the data to long format for gallons
op_data_long_gallons <- op_data %>%
  dplyr::select(ORDERED_GALLONS, LOADED_GALLONS, DELIVERED_GALLONS) %>%
  pivot_longer(cols = everything(), names_to = "gallon_type", values_to = "count") %>%
  mutate(gallon_type = factor(gallon_type, levels = c("ORDERED_GALLONS", "LOADED_GALLONS", "DELIVERED_GALLONS")))

# Define border colors based on gallon_type
border_colors_gallons <- c("ORDERED_GALLONS" = "grey", 
                           "LOADED_GALLONS" = "coral", 
                           "DELIVERED_GALLONS" = "darkred")

# Plot with histograms for gallons
ggplot(op_data_long_gallons, aes(x = count)) +
  geom_histogram(binwidth = 1, 
                 aes(fill = gallon_type, color = gallon_type), 
                 alpha = 0.7) +
  facet_wrap(~gallon_type, scales = "fixed", nrow = 1, 
             labeller = as_labeller(c("ORDERED_GALLONS" = "Ordered", 
                                      "LOADED_GALLONS" = "Loaded", 
                                      "DELIVERED_GALLONS" = "Delivered"))) +
  scale_y_continuous(trans = 'log10', 
                     breaks = scales::trans_breaks("log10", function(x) 10^x), 
                     labels = scales::trans_format("log10", scales::math_format(10^.x))) +
  scale_color_manual(values = border_colors_gallons) +  
  scale_x_continuous(limits = c(0, 1000)) +  # Limit the x-axis to 1000
  theme_minimal() +
  labs(title = "Histograms of Gallon Counts", x = "Gallon Count", y = "Frequency (Log Scale)") +
  theme(
    strip.background = element_blank(),  
    strip.text = element_text(color = "black", size = 9),  
    panel.grid.major.x = element_blank(),  
    panel.grid.minor = element_blank(),  
    panel.grid.major.y = element_line(color = "grey", size = 0.5),  
    axis.title = element_text(size = 7),
    axis.text = element_text(size = 6),
    plot.title = element_text(size = 10, face = "bold", hjust = 0.5),  
    strip.text.x = element_text(size = 8, hjust = 0.5),  
    legend.position = "none",  
    axis.text.y = element_text(size = 7),  
    axis.title.y = element_text(size = 8),  
    panel.spacing = unit(1, "lines")  
  )

We limited the histograms of gallon counts per transaction to 1000 for better visualization. There are only a few operations that exceed this limit. The vast majority of transactions do not exceed 500 gallons.

Code
# Transforming the data to long format for RETURNED_GALLONS
op_data_long_returned_gallons <- op_data %>%
  dplyr::select(RETURNED_GALLONS) %>%
  pivot_longer(cols = everything(), names_to = "gallon_type", values_to = "count") %>%
  mutate(gallon_type = factor(gallon_type, levels = c("RETURNED_GALLONS")))

# Define border colors for RETURNED_GALLONS
border_colors_returned_gallons <- c("RETURNED_GALLONS" = "black")

# Plot with histogram for RETURNED_GALLONS
ggplot(op_data_long_returned_gallons, aes(x = count)) +
  geom_histogram(binwidth = 1, 
                 aes(fill = gallon_type, color = gallon_type), 
                 alpha = 0.7) +
  scale_y_continuous(trans = 'log10', 
                     breaks = scales::trans_breaks("log10", function(x) 10^x), 
                     labels = scales::trans_format("log10", scales::math_format(10^.x))) +
  scale_color_manual(values = border_colors_returned_gallons) +  
  scale_x_continuous(limits = c(0, 500)) +  # Limit the x-axis to 500
  theme_minimal() +
  labs(title = "Returned Gallon Counts", x = "Gallon Count", y = "Frequency (Log Scale)") +
  theme(
    strip.background = element_blank(),  
    strip.text = element_text(color = "black", size = 9),  
    panel.grid.major.x = element_blank(),  
    panel.grid.minor = element_blank(),  
    panel.grid.major.y = element_line(color = "grey", size = 0.5),  
    axis.title = element_text(size = 7),
    axis.text = element_text(size = 6),
    plot.title = element_text(size = 10, face = "bold", hjust = 0.5),  
    strip.text.x = element_text(size = 8, hjust = 0.5),  
    legend.position = "none",  
    axis.text.y = element_text(size = 7),  
    axis.title.y = element_text(size = 8),  
    panel.spacing = unit(1, "lines")  
  )

The number of returned gallon transactions is much lower compared to cases. Overall, these transactions do not exceed 100 gallons.

4.11 Transaction Dates Overview

Code
# Aggregate the transactions by month/year for gallons and cases delivered
op_data_monthly_delivery <- op_data %>%
  mutate(Month_Year = floor_date(TRANSACTION_DATE, "month")) %>%
  group_by(Month_Year) %>%
  summarise(Total_Delivered_Cases = sum(DELIVERED_CASES, na.rm = TRUE),
            Total_Delivered_Gallons = sum(DELIVERED_GALLONS, na.rm = TRUE))

# Reshape the data to long format for facet_wrap
op_data_long_delivery <- op_data_monthly_delivery %>%
  pivot_longer(cols = starts_with("Total_Delivered"), 
               names_to = "Event", 
               values_to = "Value")

# Create the plot with the same Y-axis scale for both events
ggplot(op_data_long_delivery, aes(x = Month_Year, y = Value, fill = Event)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ Event, scales = "fixed", ncol = 1) +  # Use facet_wrap with a shared x-axis and same scale for both
  labs(title = "Monthly Delivered Cases and Gallons JAN 2023 - DEZ 2024",
       x = "Month",
       y = "Total Units") +
  theme_minimal() +
  theme(legend.position = "none",  # Remove the legend
        axis.text.x = element_text(size = 9),  # Adjust the size of x-axis labels for better readability
        panel.grid.major.x = element_blank(),  # Remove vertical grid lines
        panel.grid.minor.x = element_blank()) +  # Remove minor vertical grid lines
  scale_x_date(labels = scales::date_format("%b"), breaks = scales::date_breaks("1 month")) +  # Format x-axis to show only month abbreviations (JAN, FEB, etc.)
  scale_y_continuous(labels = scales::comma) +  # Display Y-axis in full units (e.g., 1000 instead of 1K)
  scale_fill_manual(values = c("Total_Delivered_Gallons" = "#FFCCCB", "Total_Delivered_Cases" = "#ADD8E6"))  # Set custom colors

The seasonal effect, related to lower temperatures (OCT-MAR), is more pronounced for the number of delivered cases than for gallons. Additionally, this chart highlights the significant difference in consumption between the two, as both quantities are represented on the same scale.

Code
# Aggregate the transactions by month/year for gallons and cases delivered
op_data_monthly_delivery <- op_data %>%
  mutate(Month_Year = floor_date(TRANSACTION_DATE, "month")) %>%
  group_by(Month_Year) %>%
  summarise(Total_Delivered_Cases = sum(DELIVERED_CASES, na.rm = TRUE),
            Total_Delivered_Gallons = sum(DELIVERED_GALLONS, na.rm = TRUE))

# Calculate the percentage of gallons in total (gallons + cases)
op_data_monthly_delivery <- op_data_monthly_delivery %>%
  mutate(Total_Sales = Total_Delivered_Cases + Total_Delivered_Gallons,
         Percentage_Gallons = (Total_Delivered_Gallons / Total_Sales) * 100)

# Create the plot with the percentage of gallons sold
ggplot(op_data_monthly_delivery, aes(x = Month_Year, y = Percentage_Gallons)) +
  geom_bar(stat = "identity", fill = "#FFCCCB") +  # Gallons color
  labs(title = "Percentage of Gallons Sold Relative to Total Sales (23 & 24)",
       x = "Month",
       y = "Percentage of Gallons (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 9, angle = 0, hjust = 1),  # Rotate x-axis labels for better readability
        panel.grid.major.x = element_blank(),  # Remove vertical grid lines
        panel.grid.minor.x = element_blank()) +  # Remove minor vertical grid lines
  scale_x_date(labels = scales::date_format("%b"), breaks = scales::date_breaks("1 month")) +  # Format x-axis to show month abbreviations
  scale_y_continuous(labels = scales::percent_format(scale = 1), 
                     breaks = seq(0, 100, by = 5))  # Set y-axis breaks to show percentages in 5% increments

The sale of gallons over the months remains between 20% and 25% of the total volume.

4.12 Retailer Consumption Quantities

Code
# Count distinct Retailers
cat("Number of Retailers:", n_distinct(full_data$PRIMARY_GROUP_NUMBER), "\n")
Number of Retailers: 1021 
Code
# Count distinct stores
cat("Number of Outlets/Stores:", n_distinct(full_data$CUSTOMER_NUMBER), "\n")
Number of Outlets/Stores: 30320 

Of the 30,320 stores, many belong to the same chains, with 1,020 networks represented in the dataset. (PRIMARY_GROUP_NUMBER = 0 represents the single stores.)

Code
# Creates the total deliveries by customer type (Single Store or Retailer Group)
total_delivered <- full_data %>%
  mutate(customer_type = ifelse(PRIMARY_GROUP_NUMBER == 0, "Single Store", "Retailer Group")) %>%
  group_by(customer_type) %>%
  summarise(
    qtd_cases_dlv_23 = sum(ifelse(YEAR == 2023, DELIVERED_CASES, 0), na.rm = TRUE),
    qtd_cases_dlv_24 = sum(ifelse(YEAR == 2024, DELIVERED_CASES, 0), na.rm = TRUE),
    total_qtd_cases_dlv = sum(DELIVERED_CASES, na.rm = TRUE),
    total_qtd_gallons_dlv = sum(DELIVERED_GALLONS, na.rm = TRUE)
  ) %>%
  ungroup()

# Calculates global totals for delivered cases and delivered gallons
total_cases <- sum(full_data$DELIVERED_CASES, na.rm = TRUE)
total_gallons <- sum(full_data$DELIVERED_GALLONS, na.rm = TRUE)

# Calculates the percentage for each group
total_delivered <- total_delivered %>%
  mutate(
    perc_total_qtd_cases = (total_qtd_cases_dlv / total_cases) * 100,
    perc_total_gallons = (total_qtd_gallons_dlv / total_gallons) * 100
  )

# Converts to data.table for efficient processing
setDT(total_delivered)

# Rounds percentages
total_delivered[, perc_total_qtd_cases := round(perc_total_qtd_cases, 0)]
total_delivered[, perc_total_gallons := round(perc_total_gallons, 0)]

# Adds a 'Total' row with global totals
total_delivered_total <- total_delivered %>%
  summarise(
    customer_type = "Total",
    qtd_cases_dlv_23 = sum(qtd_cases_dlv_23),
    qtd_cases_dlv_24 = sum(qtd_cases_dlv_24),
    total_qtd_cases_dlv = sum(total_qtd_cases_dlv),
    total_qtd_gallons_dlv = sum(total_qtd_gallons_dlv),
    perc_total_qtd_cases = 100,
    perc_total_gallons = 100
  ) %>%
  as.data.table()

# Combines the 'Total' row with the previous data
total_delivered <- rbind(total_delivered, total_delivered_total)

# Creates the cases table with the relevant columns
cases_table <- total_delivered[, .(
  customer_type,
  qtd_cases_dlv_23,
  qtd_cases_dlv_24,
  total_qtd_cases_dlv,
  perc_total_qtd_cases
)]

# Creates the gallons table with the same columns as the cases table
gallons_table <- total_delivered[, .(
  customer_type,
  qtd_gallons_dlv_23 = total_qtd_gallons_dlv,  # Corresponding for 2023
  qtd_gallons_dlv_24 = total_qtd_gallons_dlv,  # Corresponding for 2024
  total_qtd_cases_dlv = total_qtd_gallons_dlv, # Total gallons
  perc_total_qtd_cases = perc_total_gallons  # Percentage for gallons
)]

# Creates the total table with the relevant columns
total_table <- total_delivered[, .(
  customer_type, 
  qtd_cas_gal_23 = qtd_cases_dlv_23 + total_qtd_gallons_dlv,
  qtd_cas_gal_24 = qtd_cases_dlv_24 + total_qtd_gallons_dlv,
  total_qtd_cas_gal = total_qtd_cases_dlv + total_qtd_gallons_dlv,
  perc_total_qtd = ((total_qtd_cases_dlv + total_qtd_gallons_dlv) / (total_cases + total_gallons)) * 100
)]

# Rounds the percentage for the total table
total_table[, perc_total_qtd := round(perc_total_qtd, 0)]

# Format the numeric columns with a thousand separator for all tables
format_cols_cases <- c("qtd_cases_dlv_23", "qtd_cases_dlv_24", "total_qtd_cases_dlv", "perc_total_qtd_cases")
format_cols_gallons <- c("qtd_gallons_dlv_23", "qtd_gallons_dlv_24", "total_qtd_cases_dlv", "perc_total_qtd_cases")
format_cols_total <- c("qtd_cas_gal_23", "qtd_cas_gal_24", "total_qtd_cas_gal", "perc_total_qtd")

# Format the columns after the tables are created
cases_table[, (format_cols_cases) := lapply(.SD, function(x) format(x, big.mark = ",", scientific = FALSE)), .SDcols = format_cols_cases]
gallons_table[, (format_cols_gallons) := lapply(.SD, function(x) format(x, big.mark = ",", scientific = FALSE)), .SDcols = format_cols_gallons]
total_table[, (format_cols_total) := lapply(.SD, function(x) format(x, big.mark = ",", scientific = FALSE)), .SDcols = format_cols_total]

# Displays cases
cases_table %>%
  kable("html", escape = FALSE, align = "c") %>%
  kable_styling(full_width = F, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:5, width = "6em") %>%
  row_spec(0, bold = TRUE, color = "black", background = "#ADD8E6") %>%  # Light blue header
  add_header_above(c("CASES - Statistics by deliveries greater than 0" = 5)) %>%
  kable_paper("striped", full_width = F)
CASES - Statistics by deliveries greater than 0
customer_type qtd_cases_dlv_23 qtd_cases_dlv_24 total_qtd_cases_dlv perc_total_qtd_cases
Retailer Group 10,099,875 10,770,367 20,870,242 79
Single Store 2,684,696 2,879,141 5,563,837 21
Total 12,784,571 13,649,508 26,434,079 100

Considering cases, 80% of the volume went to stores that belong to larger groups.

Code
# Displays gallons
gallons_table %>%
  kable("html", escape = FALSE, align = "c") %>%
  kable_styling(full_width = F, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:5, width = "6em") %>%
  row_spec(0, bold = TRUE, color = "black", background = "#FFCCCB") %>%  # Light red header
  add_header_above(c("GALLONS - Statistics by deliveries greater than 0" = 5)) %>%
  kable_paper("striped", full_width = F)
GALLONS - Statistics by deliveries greater than 0
customer_type qtd_gallons_dlv_23 qtd_gallons_dlv_24 total_qtd_cases_dlv perc_total_qtd_cases
Retailer Group 4,565,535 4,565,535 4,565,535 47
Single Store 5,094,657 5,094,657 5,094,657 53
Total 9,660,192 9,660,192 9,660,192 100

As for gallons, the distribution is similar, with 53% going to single stores and 47% to retailer groups, indicating that local stores have a greater share in gallon consumption compared to cases.

Code
# Displays total (cases + gallons)
total_table %>%
  kable("html", escape = FALSE, align = "c") %>%
  kable_styling(full_width = F, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:5, width = "6em") %>%
  row_spec(0, bold = TRUE, color = "black", background = "lightgray") %>%  # Light blue header
  add_header_above(c("TOTAL - Combined Deliveries Quantities for Cases and Gallons" = 5)) %>%
  kable_paper("striped", full_width = F)
TOTAL - Combined Deliveries Quantities for Cases and Gallons
customer_type qtd_cas_gal_23 qtd_cas_gal_24 total_qtd_cas_gal perc_total_qtd
Retailer Group 14,665,410 15,335,902 25,435,777 70
Single Store 7,779,354 7,973,798 10,658,494 30
Total 22,444,764 23,309,700 36,094,271 100

The table below helps to better explore the data presented above.

Code
# Summarize delivered cases and gallons for 2023 and 2024
summary_2023 <- full_data %>%
  filter(YEAR == 2023) %>%
  group_by(PRIMARY_GROUP_NUMBER) %>%
  summarise(
    cas_qtd_dlv23 = sum(DELIVERED_CASES, na.rm = TRUE),
    gal_qtd_dlv23 = sum(DELIVERED_GALLONS, na.rm = TRUE)
  )

summary_2024 <- full_data %>%
  filter(YEAR == 2024) %>%
  group_by(PRIMARY_GROUP_NUMBER) %>%
  summarise(
    cas_qtd_dlv24 = sum(DELIVERED_CASES, na.rm = TRUE),
    gal_qtd_dlv24 = sum(DELIVERED_GALLONS, na.rm = TRUE)
  )

# Merge summaries and compute total values
group_demand <- full_join(summary_2023, summary_2024, by = "PRIMARY_GROUP_NUMBER") %>%
  mutate(
    across(c(cas_qtd_dlv23, gal_qtd_dlv23, cas_qtd_dlv24, gal_qtd_dlv24), ~replace_na(., 0)),
    total_23 = cas_qtd_dlv23 + gal_qtd_dlv23,
    total_24 = cas_qtd_dlv24 + gal_qtd_dlv24,
    sum_23_24 = total_23 + total_24
  ) %>%
  rename(PGN = PRIMARY_GROUP_NUMBER) %>%
  arrange(desc(sum_23_24)) 
#  %>%
#  filter(PGN != 0)  # Exclude rows where PRIMARY_GROUP_NUMBER is 0

# Convert to data.table for performance
setDT(group_demand)

# Display interactive table with formatted numbers (without changing type)
datatable(
  group_demand, 
  options = list(pageLength = 10, autoWidth = TRUE),
  rownames = FALSE,
  caption = "Quantity Delivered"
) %>%
  formatCurrency(
    columns = c("cas_qtd_dlv23", "gal_qtd_dlv23", "cas_qtd_dlv24", "gal_qtd_dlv24", "total_23", "total_24", "sum_23_24"),
    currency = "",  # No currency symbol
    digits = 0,  # No decimal places
    mark = ","  # Thousands separator
  )
Code
# List all variables in the environment
all_vars <- ls()

# Exclude 'full_data', 'full_data_customer', and the new variables from removal
vars_to_keep <- c("full_data", "full_data_customer", "cost_data", "customer_address", 
                  "mydir", "one_seed", "op_data", "profile_data", "reference_date","custom_palette")

# Get the variables to remove
vars_to_remove <- setdiff(all_vars, vars_to_keep)

# Remove the temporary data frames
rm(list = vars_to_remove)

# Clean up by removing 'all_vars' and 'vars_to_remove'
rm(all_vars, vars_to_remove)

5. Feature Engineering

Considering all the previous analyses, the goal now is to complement the information that can enhance the robustness of the modeling process. Several feature engineering techniques were attempted, but only the most relevant ones will be described.

5.1 Census Data

The data used for updating the location information comes from the U.S. Census Bureau, specifically the American Community Survey (ACS), which annually adjusts its results based on the most recent data. For 2023, the ACS data was retrieved, which is adjusted using the 2020 Census data. However, data for 2024 was not yet available at the time of retrieval.

The decision to use coordinates for store locations, even when there are multiple instances of identical coordinates across different ZIP codes, was made due to the challenges encountered when retrieving Census data based on ZIP codes. Different stores or customers within the same ZIP code can share coordinates, particularly in areas like shopping centers with multiple businesses.

Below are the descriptions of the import data:

Code
#Creating the data for the table
 census_data <- tibble(
   variable = c(
     "MED_HH_INC", "GINI_IDX", "PER_CAP_INC", "MED_HOME_VAL", "POV_POP", 
     "INC_LVL_1", "INC_LVL_2", "INC_LVL_3", "INC_LVL_4", "INC_LVL_5", 
     "INC_LVL_6", "INC_LVL_7", "INC_LVL_8", "INC_LVL_9", "INC_LVL_10", 
     "INC_LVL_11", "INC_LVL_12", "INC_LVL_13", "INC_LVL_14", "INC_LVL_15", 
     "INC_LVL_16", "TOT_HOUS_UNITS", "VAC_HOUS_UNITS", "MED_GROSS_RENT", "BACH_DEG", 
     "MAST_DEG", "DOC_DEG", "UNEMP_POP", "EMP_POP", "TOT_WORK_POP", 
     "SNAP_HH", "MED_FAM_INC", "TOT_POP", "MALE_POP", "FEMALE_POP", 
     "COMMUTE_POP", "COMMUTE_POP_DRIVE"
   ),
   description = c(
     "Median household income", "Gini index of income inequality", 
     "Per capita income", "Median home value", "Population below poverty", 
     "Income less than $10,000", "$10,000 to $14,999", "$15,000 to $19,999", 
     "$20,000 to $24,999", "$25,000 to $29,999", "$30,000 to $34,999", 
     "$35,000 to $39,999", "$40,000 to $44,999", "$45,000 to $49,999", 
     "$50,000 to $59,999", "$60,000 to $74,999", "$75,000 to $99,999", 
     "$100,000 to $124,999", "$125,000 to $149,999", "$150,000 to $199,999", 
     "$200,000 or more", "Total housing units", "Vacant housing units", 
     "Median gross rent", "Bachelor's degree holders", "Master's degree holders", 
     "Doctoral degree holders", "Unemployed population", "Employed population", 
     "Total working population", "Food stamp households", "Median family income", 
     "Total population", "Male population", "Female population", 
     "Total commuter population", "Total commuter population driving"
   )
 )

 #Table
datatable(census_data, 
          options = list(scrollX = TRUE, pageLength = 10), 
          caption = "List of Census Variables and Descriptions")
Code
library(tidycensus)
library(sf) 

# Census Bureau API key
#census_api_key(" ", install = TRUE)

# Create a copy of full_data_customer with only the relevant columns
data_sf <- full_data_customer %>%
  dplyr::select(CUSTOMER_NUMBER, LONGITUDE, LATITUDE)

# Convert customer data to sf object
data_sf <- data_sf %>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326)

# Ensure the 'census_variables' object is defined
census_variables <- tibble(
  code = c(
    "B19013_001", "B19083_001", "B19301_001", "B25077_001", "B17001_002", 
    "B19001_002", "B19001_003", "B19001_004", "B19001_005", "B19001_006", 
    "B19001_007", "B19001_008", "B19001_009", "B19001_010", "B19001_011", 
    "B19001_012", "B19001_013", "B19001_014", "B19001_015", "B19001_016", 
    "B19001_017", "B25001_001", "B25002_003", "B25064_001", "B15003_017", 
    "B15003_022", "B15003_025", "B23025_005", "B23025_004", "B24011_001", 
    "B22001_002", "B19058_001", "B01003_001", "B01001_002", "B01001_026", 
    "B08006_001", "B08006_002"
  ),
  description = c(
    "MED_HH_INC", "GINI_IDX", "PER_CAP_INC", "MED_HOME_VAL", "POV_POP", 
    "INC_LVL_1", "INC_LVL_2", "INC_LVL_3", "INC_LVL_4", "INC_LVL_5", 
    "INC_LVL_6", "INC_LVL_7", "INC_LVL_8", "INC_LVL_9", "INC_LVL_10", 
    "INC_LVL_11", "INC_LVL_12", "INC_LVL_13", "INC_LVL_14", "INC_LVL_15", 
    "INC_LVL_16", "TOT_HOUS_UNITS", "VAC_HOUS_UNITS", "MED_GROSS_RENT", "BACH_DEG", 
    "MAST_DEG", "DOC_DEG", "UNEMP_POP", "EMP_POP", "TOT_WORK_POP", 
    "SNAP_HH", "MED_FAM_INC", "TOT_POP", "MALE_POP", "FEMALE_POP", 
    "COMMUTE_POP", "COMMUTE_POP_DRIVE"
  ),
  full_description = c(
    "Median household income", "Gini index of income inequality", 
    "Per capita income", "Median home value", "Population below poverty", 
    "Income less than $10,000", "$10,000 to $14,999", "$15,000 to $19,999", 
    "$20,000 to $24,999", "$25,000 to $29,999", "$30,000 to $34,999", 
    "$35,000 to $39,999", "$40,000 to $44,999", "$45,000 to $49,999", 
    "$50,000 to $59,999", "$60,000 to $74,999", "$75,000 to $99,999", 
    "$100,000 to $124,999", "$125,000 to $149,999", "$150,000 to $199,999", 
    "$200,000 or more", "Total housing units", "Vacant housing units", 
    "Median gross rent", "Bachelor's degree holders", "Master's degree holders", 
    "Doctoral degree holders", "Unemployed population", "Employed population", 
    "Total working population", "Food stamp households", "Median family income", 
    "Total population", "Male population", "Female population", 
    "Total commuter population", "Total commuter population driving"
  )
)

# Retrieve ACS data
acs_data <- get_acs(
  geography = "tract",
  variables = census_variables$code,
  year = 2023,
  state = unique(full_data_customer$STATE),
  geometry = TRUE
)

# Merge with descriptions
acs_data <- acs_data %>%
  left_join(census_variables, by = c("variable" = "code"))

# Transform CRS to match customer data
data_sf <- st_transform(data_sf, st_crs(acs_data))

# Perform spatial join
joined_data_sf <- st_join(data_sf, acs_data, join = st_intersects)

# Reshape the dataset, keeping only the 'estimate' values
census <- joined_data_sf %>%
  mutate(
    variable_name = if_else(variable %in% census_variables$code, description, variable)
  ) %>%
  pivot_wider(
    names_from = variable_name,
    values_from = estimate,
    names_glue = "{variable_name}"
  )

# Select only the required columns
census <- census %>%
  dplyr::select(
    CUSTOMER_NUMBER, MED_HH_INC, GINI_IDX, PER_CAP_INC, MED_HOME_VAL, POV_POP, 
    INC_LVL_1, INC_LVL_2, INC_LVL_3, INC_LVL_4, INC_LVL_5, INC_LVL_6, 
    INC_LVL_7, INC_LVL_8, INC_LVL_9, INC_LVL_10, INC_LVL_11, INC_LVL_12, 
    INC_LVL_13, INC_LVL_14, INC_LVL_15, INC_LVL_16, TOT_HOUS_UNITS, 
    VAC_HOUS_UNITS, MED_GROSS_RENT, BACH_DEG, MAST_DEG, DOC_DEG, UNEMP_POP, 
    EMP_POP, TOT_WORK_POP, SNAP_HH, MED_FAM_INC, TOT_POP, MALE_POP, FEMALE_POP, 
    COMMUTE_POP, COMMUTE_POP_DRIVE
  )

# Remove the geometry column and convert to a normal data frame
census <- census %>%
  st_drop_geometry() %>% 
  as.data.frame()

# Handle missing and infinite values (replace -Inf with NA)
census[census == -Inf] <- NA

# Optionally impute missing values or remove them
census[is.na(census)] <- 0  # You could also choose to impute using other strategies

# Aggregate census data by CUSTOMER_NUMBER, keeping the highest value for each column
census <- census %>%
  group_by(CUSTOMER_NUMBER) %>%
  summarise(across(everything(), max, na.rm = TRUE), .groups = "drop")

# Perform the join between full_data_customer and census on the CUSTOMER_NUMBER column
full_data_customer <- full_data_customer %>%
  dplyr::left_join(census, by = "CUSTOMER_NUMBER")

# Remove any duplicated columns or columns with ".x" suffixes
full_data_customer <- full_data_customer %>%
  dplyr::select(-ends_with(".x")) %>%
  dplyr::rename_with(~gsub("\\.y$", "", .), ends_with(".y"))

# Transforming variable types before save
full_data_customer$COLD_DRINK_CHANNEL <- as.factor(full_data_customer$COLD_DRINK_CHANNEL)
full_data_customer$TRADE_CHANNEL <- as.factor(full_data_customer$TRADE_CHANNEL)
full_data_customer$SUB_TRADE_CHANNEL <- as.factor(full_data_customer$SUB_TRADE_CHANNEL)

During the modeling process, it became clear that the absence of 2024 data limited the analysis. In addition, correlations between the census variables and, in particular, customer demand volumes were very low. Because of this, these variables were not explored further in the document. The goal is for this initial process to serve as a foundation for future analyses.

5.2 RFM Score

The RFM (Recency, Frequency, Monetary) analysis segments customers based on purchasing behavior, providing insights into consumption patterns. Adapting this model to analyze customer orders helps assess both the frequency and volume of purchases.

5.2.1 Frequency - Days Between Orders

To adapt the RFM analysis by considering purchase periods and quantities ordered, the analysis will focus on customer orders. Before calculating the number of days between orders (frequency), the total number of orders per customer will be determined, considering only those with a quantity of gallons or cases greater than 0.

Code
# Filter valid transactions (ORDERED_CASES > 0 or ORDERED_GALLONS > 0)
valid_orders <- full_data %>%
  filter(ORDERED_CASES > 0 | ORDERED_GALLONS > 0)

# Calculate the number of orders > 0 per customer
orders_per_customer <- valid_orders %>%
  group_by(CUSTOMER_NUMBER) %>%
  summarise(NUM_ORDERS = n(), .groups = "drop") %>%
  ungroup()

# Add the column NUM_ORDERS in full_data_customer
full_data_customer <- full_data_customer %>%
  left_join(orders_per_customer, by = "CUSTOMER_NUMBER")

# Find customers who do not meet the condition (NO valid transactions)
customers_not_meeting_filter <- full_data_customer %>%
  filter(is.na(NUM_ORDERS)) %>%
  summarise(unique_customers = n_distinct(CUSTOMER_NUMBER))

# Print the number of unique customers who don't meet the filter
#print(customers_not_meeting_filter)

# Remove unnecessary intermediate data frames
rm(valid_orders, orders_per_customer,customers_not_meeting_filter)

There are 135 customers who do not have order transactions greater than zero in the dataset; for these customers, I will consider the number of delivery transactions as orders.

Code
# Filter customers with NUM_ORDERS == NA
customers_with_na_orders <- full_data_customer %>%
  filter(is.na(NUM_ORDERS)) %>%
  dplyr::select(CUSTOMER_NUMBER) %>%
  distinct()

# Filter valid delivery transactions (DELIVERED_CASES > 0 or DELIVERED_GALLONS > 0) in full_data
valid_deliveries <- full_data %>%
  filter(DELIVERED_CASES > 0 | DELIVERED_GALLONS > 0)

# Calculate the number of valid deliveries per customer with NUM_ORDERS == NA
deliveries_per_customer <- valid_deliveries %>%
  filter(CUSTOMER_NUMBER %in% customers_with_na_orders$CUSTOMER_NUMBER) %>%
  group_by(CUSTOMER_NUMBER) %>%
  summarise(NUM_DELIVERIES = n()) %>%
  ungroup()

# Update NUM_ORDERS only for customers with NUM_ORDERS == NA
full_data_customer <- full_data_customer %>%
  left_join(deliveries_per_customer, by = "CUSTOMER_NUMBER") %>%
  mutate(
    NUM_ORDERS = if_else(is.na(NUM_ORDERS), NUM_DELIVERIES, NUM_ORDERS)
  ) %>%
  dplyr::select(-NUM_DELIVERIES)  # Drop the temporary NUM_DELIVERIES column

# Ensure full_data has the NUM_ORDERS column with the same values as full_data_customer
full_data <- full_data %>%
  left_join(full_data_customer %>% dplyr::select(CUSTOMER_NUMBER, NUM_ORDERS), by = "CUSTOMER_NUMBER")

# Remove unnecessary intermediate data frames
rm(customers_with_na_orders, valid_deliveries, deliveries_per_customer)

Considering all the order transactions recorded in 2023 and 2024, each unique customer has a minimum of 1 transaction and a maximum of 392 transactions.

To better understand the consumption profile of each customer, below we will visualize the number of customers in transaction bins where the orders of cases or gallons were greater than 0. For the 135 unique customers who did not have order transactions but received volume, we considered these operations as orders.

Code
# Count the number of valid transactions per customer
customers_by_bin <- full_data_customer %>%
  group_by(CUSTOMER_NUMBER) %>%
  summarise(transaction_count = sum(NUM_ORDERS, na.rm = TRUE), .groups = "drop") %>%
  mutate(transaction_bin = case_when(
    transaction_count == 1 ~ "1",
    transaction_count >= 2 & transaction_count <= 10 ~ "2-10",
    transaction_count >= 11 & transaction_count <= 20 ~ "11-20",
    transaction_count >= 21 & transaction_count <= 30 ~ "21-30",
    transaction_count >= 31 & transaction_count <= 40 ~ "31-40",
    transaction_count >= 41 & transaction_count <= 50 ~ "41-50",
    transaction_count >= 51 & transaction_count <= 100 ~ "51-100",
    transaction_count >= 101 & transaction_count <= 200 ~ "101-200",
    transaction_count >= 201 & transaction_count <= 300 ~ "201-300",
    transaction_count > 300 ~ ">300",
    TRUE ~ "Other"
  )) %>%
  mutate(transaction_bin = factor(transaction_bin, levels = c("1", "2-10", "11-20", "21-30", "31-40", 
                                                             "41-50", "51-100", "101-200", "201-300", ">300"))) %>%
  group_by(transaction_bin) %>%
  summarise(unique_customers = n_distinct(CUSTOMER_NUMBER), .groups = "drop") %>%
  arrange(transaction_bin)

# Create a bar plot resembling a histogram of unique customers per transaction bin
ggplot(customers_by_bin, aes(x = transaction_bin, y = unique_customers, fill = transaction_bin)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  geom_text(aes(label = unique_customers), vjust = -0.3, size = 3, color = "black") +  # Add customer counts above bars
  scale_fill_brewer(palette = "Set3") +  # Use RColorBrewer's Set3 palette
  labs(title = "Number of Unique Customers by Transaction Count (Orders > 0)",
       x = "Transaction Count Bins",
       y = "Number of Unique Customers") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(hjust = 0.5, vjust = 0.5),  # Centered x-axis labels without rotation
    panel.grid.major.x = element_blank(),  # Remove vertical grid lines
    panel.grid.minor.x = element_blank(),  # Remove minor vertical grid lines
    axis.text = element_text(size = 9),  # Set the size of axis labels
    axis.title = element_text(size = 10)  # Set the size of axis titles
  )

Code
# Remove unnecessary intermediate data frames
rm(customers_by_bin)

The histogram shows that 1,218 customers have only one order transaction, making it impossible to calculate the days between orders. Additionally, 6,798 customers have between 2 and 10 orders. To ensure more reliable figures, we will consider only customers with at least 11 orders for this indicator. As a result, all customers with fewer transactions will be assigned a value of 731 days between orders, indicating low order frequency over a two-year range.

Code
# Calculate the number of days between orders for customers with NUM_ORDERS >= 11
full_data <- full_data %>%
  arrange(CUSTOMER_NUMBER, TRANSACTION_DATE) %>%  # Sort by CUSTOMER_NUMBER and TRANSACTION_DATE
  group_by(CUSTOMER_NUMBER) %>%
  mutate(DAYS_BETWEEN_ORD = case_when(
    NUM_ORDERS <= 10 ~ 731,  # Set DAYS_BETWEEN_ORD to 731 for customers with NUM_ORDERS <= 10
    NUM_ORDERS >= 11 & 
      (ORDERED_CASES > 0 | ORDERED_GALLONS > 0) ~ 
      as.numeric(difftime(TRANSACTION_DATE, lag(TRANSACTION_DATE), units = "days")),  # Calculate days between orders for NUM_ORDERS >= 11 where ORDERED_CASES or ORDERED_GALLONS > 0
    NUM_ORDERS >= 11 & 
      !(ORDERED_CASES > 0 | ORDERED_GALLONS > 0) &  # Only apply this when the previous condition fails
      (DELIVERED_CASES > 0 | DELIVERED_GALLONS > 0) ~ 
      as.numeric(difftime(TRANSACTION_DATE, lag(TRANSACTION_DATE), units = "days")),  # If no ORDERED_CASES or ORDERED_GALLONS > 0, calculate with DELIVERED_CASES or DELIVERED_GALLONS
    TRUE ~ NA_real_  # For all other cases
  )) %>%
  ungroup()


# Calculate the average days between orders per customer and round the result to the nearest integer
avg_days_per_customer <- full_data %>%
  group_by(CUSTOMER_NUMBER) %>%
  summarise(AVG_DAYS_BET_ORD = round(mean(DAYS_BETWEEN_ORD, na.rm = TRUE), 0)) %>%  # Round to nearest integer
  ungroup()

# Update full_data_customer with the average days between orders
full_data_customer <- full_data_customer %>%
  left_join(avg_days_per_customer, by = "CUSTOMER_NUMBER")

# Remove temporary variables
rm(avg_days_per_customer)
Code
# Count the number of unique customers in each days between orders bin without adding a new column to the dataset
customers_by_bin <- full_data_customer %>%
  mutate(DAYS_BETWEEN_ORD_BIN = case_when(
    AVG_DAYS_BET_ORD >= 1 & AVG_DAYS_BET_ORD <= 10 ~ "1-10 days",
    AVG_DAYS_BET_ORD > 10 & AVG_DAYS_BET_ORD <= 20 ~ "11-20 days",
    AVG_DAYS_BET_ORD > 20 & AVG_DAYS_BET_ORD <= 30 ~ "21-30 days",
    AVG_DAYS_BET_ORD > 30 & AVG_DAYS_BET_ORD <= 33~ "31-40 days",
    AVG_DAYS_BET_ORD > 40 & AVG_DAYS_BET_ORD <= 50 ~ "41-50 days",
    AVG_DAYS_BET_ORD > 50 ~ "Above 50 days",
    TRUE ~ "One Order Only"
  )) %>%
  group_by(DAYS_BETWEEN_ORD_BIN) %>%
  summarise(unique_customers = n_distinct(CUSTOMER_NUMBER), .groups = "drop") %>%
  mutate(percentage_customers = unique_customers / sum(unique_customers) * 100) %>%  # Calculate percentage
  arrange(DAYS_BETWEEN_ORD_BIN)

# Create a bar plot resembling a histogram of unique customers percentage per days between orders bin
ggplot(customers_by_bin, aes(x = DAYS_BETWEEN_ORD_BIN, y = percentage_customers, fill = DAYS_BETWEEN_ORD_BIN)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  geom_text(aes(label = scales::percent(percentage_customers / 100)), vjust = -0.3, size = 3) +  # Add percentage labels above bars
  scale_fill_brewer(palette = "Set3") +  # Use RColorBrewer's Set3 palette
  labs(title = "Percentage of Unique Customers by Days Between Orders",
       x = "Days Between Orders",
       y = "Percentage of Unique Customers") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(hjust = 0.5, vjust = 0.5),  # Centered x-axis labels without rotation
    panel.grid.major.x = element_blank(),  # Remove vertical grid lines
    panel.grid.minor.x = element_blank(),  # Remove minor vertical grid lines
    axis.text = element_text(size = 9),  # Set the size of axis labels
    axis.title = element_text(size = 10)  # Set the size of axis titles
  )

Code
# Remove unnecessary intermediate data frames
rm(customers_by_bin)

Around 20% of customers had an average order interval of up to 10 days, while 44% showed an average interval of more than 30 days.Approximately 5% of customers placed only one order, making it impossible to calculate the number of days between orders.

5.2.2 Recency - Time Since Last Order

To calculate recency, I will consider the number of days between the date of the last order and 01-01-2025.

Code
# Create the LAST_ORDER_DATE column, excluding rows where all specified columns are zero
full_data <- full_data %>%
  group_by(CUSTOMER_NUMBER) %>%
  mutate(
    LAST_ORDER_DATE = if_else(
      (ORDERED_CASES > 0 | ORDERED_GALLONS > 0) & 
      !(ORDERED_CASES == 0 & ORDERED_GALLONS == 0 & LOADED_CASES == 0 & LOADED_GALLONS == 0 & DELIVERED_CASES == 0 & DELIVERED_GALLONS == 0),
      as.character(max(TRANSACTION_DATE, na.rm = TRUE)), 
      NA_character_
    )
  ) %>%
  ungroup()

There are 5,754 transaction rows where assigning the last transaction date based on orders is not possible. For these, the date of the last delivery operation will be used as the reference date. The last two transactions, referring to return transactions, will be excluded.

Code
# For customers with LAST_ORDER_DATE as NA, consider the latest TRANSACTION_DATE where DELIVERED_CASES or DELIVERED_GALLONS > 0
full_data <- full_data %>%
  mutate(LAST_ORDER_DATE = as.Date(LAST_ORDER_DATE)) %>%  # Convert LAST_ORDER_DATE to Date format
  group_by(CUSTOMER_NUMBER) %>%
  mutate(
    LAST_ORDER_DATE = if_else(
      is.na(LAST_ORDER_DATE) & (ORDERED_CASES == 0 & ORDERED_GALLONS == 0),
      as.Date(max(TRANSACTION_DATE[DELIVERED_CASES > 0 | DELIVERED_GALLONS > 0], na.rm = TRUE)),
      LAST_ORDER_DATE
    )
  ) %>%
  ungroup()


# Remove the last 2 rows where LAST_ORDER_DATE is NA (return operations only)
full_data <- full_data %>%
  filter(!is.na(LAST_ORDER_DATE))

# Remove rows where LAST_ORDER_DATE is Inf (return operations only)
full_data <- full_data %>%
  filter(!is.infinite(LAST_ORDER_DATE))

# Reference Date
reference_date <- as.Date("2025-01-01")

# Create the DAYS_AF_LAST_ORD column in full_data
full_data <- full_data %>%
  mutate(
    DAYS_AF_LAST_ORD = ifelse(!is.na(LAST_ORDER_DATE), 
                              as.numeric(difftime(reference_date, LAST_ORDER_DATE, units = "days")),
                              NA_real_))

# Aggregate full_data to get the latest LAST_ORDER_DATE and DAYS_AF_LAST_ORD for each CUSTOMER_NUMBER
full_data_aggregated <- full_data %>%
  group_by(CUSTOMER_NUMBER) %>%
  summarise(
    LAST_ORDER_DATE = max(LAST_ORDER_DATE, na.rm = TRUE),
    DAYS_AF_LAST_ORD = max(DAYS_AF_LAST_ORD, na.rm = TRUE),
    .groups = 'drop'
  )

# Join the aggregated data with full_data_customer
full_data_customer <- full_data_customer %>%
  left_join(full_data_aggregated, by = "CUSTOMER_NUMBER")


# # Remove unnecessary intermediate data frames
rm(full_data_aggregated)

5.2.3 Total Quantity Ordered

As there is no access to the prices charged, and considering that they likely vary among customer types and demanded volumes, the focus will be on the quantities demanded instead of monetary values. This approach aligns with the current objective of customer segmentation.

Code
# Calculate the total ordered by customer by summing ORDERED_CASES and ORDERED_GALLONS
total_ordered_per_customer <- full_data %>%
  group_by(CUSTOMER_NUMBER) %>%
  summarise(TOTAL_ORDERED = sum(ORDERED_CASES, na.rm = TRUE) + sum(ORDERED_GALLONS, na.rm = TRUE)) %>%
  ungroup()

# Add the TOTAL_ORDERED column to full_data_customer by CUSTOMER_NUMBER
full_data_customer <- full_data_customer %>%
  left_join(total_ordered_per_customer, by = "CUSTOMER_NUMBER")

# Identify customers with TOTAL_ORDERED == 0
customers_with_zero_ordered <- total_ordered_per_customer %>%
  filter(TOTAL_ORDERED == 0)

# For those customers, calculate DELIVERED_CASES + DELIVERED_GALLONS from full_data
deliveries_for_zero_orders <- full_data %>%
  filter(CUSTOMER_NUMBER %in% customers_with_zero_ordered$CUSTOMER_NUMBER) %>%
  group_by(CUSTOMER_NUMBER) %>%
  summarise(DELIVERED_TOTAL = sum(DELIVERED_CASES, na.rm = TRUE) + sum(DELIVERED_GALLONS, na.rm = TRUE)) %>%
  ungroup()

# Merge the delivery values into the total_ordered_per_customer dataframe,
# ensuring that if TOTAL_ORDERED is zero, it is replaced by DELIVERED_TOTAL
total_ordered_per_customer <- total_ordered_per_customer %>%
  left_join(deliveries_for_zero_orders, by = "CUSTOMER_NUMBER") %>%
  mutate(
    TOTAL_ORDERED = if_else(TOTAL_ORDERED == 0, DELIVERED_TOTAL, TOTAL_ORDERED)
  ) %>%
  dplyr::select(CUSTOMER_NUMBER, TOTAL_ORDERED)

# Add the updated TOTAL_ORDERED column to full_data_customer by CUSTOMER_NUMBER
full_data_customer <- full_data_customer %>%
  left_join(total_ordered_per_customer, by = "CUSTOMER_NUMBER")

# Remove the 'TOTAL_ORDERED.x' column and rename 'TOTAL_ORDERED.y' to 'TOTAL_ORDERED'
full_data_customer <- full_data_customer %>%
  dplyr::select(-TOTAL_ORDERED.x) %>%
  dplyr::rename(TOTAL_ORDERED = TOTAL_ORDERED.y)

# Remove unnecessary intermediate data frames
rm(total_ordered_per_customer, customers_with_zero_ordered, deliveries_for_zero_orders)

5.2.4 Adapted RFM Score

Scores were assigned to classes based on the distribution of the created variables. The total score, combined with its relative weight, formed the RFM_SCORE, which served as an additional variable for customer analysis and segmentation.

To define these scores, the quantitative distribution of each variable was used, especially considering the wide range observed in some of them. Each variable received a score from 1 to 10. In the case of frequency, two separate variables were created, and weight was given not only to the number of orders but also to the interval between them. As a result, the total score ranged from 4 to 40.

Code
# Remove previously created columns
#full_data_customer <- full_data_customer %>%
#  dplyr::select(-FREQUENCY_SCORE, -RECENCY_SCORE, -VOLUME_SCORE, -RFM_SCORE)

# Create Frequency Score based on NUM_ORDERS
full_data_customer <- full_data_customer %>%
  mutate(
    ORDER_FREQUENCY_SCORE = case_when(
      NUM_ORDERS >= 300 ~ 10,  
      NUM_ORDERS >= 200 ~ 9,
      NUM_ORDERS >= 150 ~ 8,
      NUM_ORDERS >= 100 ~ 7,
      NUM_ORDERS >= 75  ~ 6,
      NUM_ORDERS >= 50  ~ 5,  # 3rd quartile
      NUM_ORDERS >= 35  ~ 4,  # Mean
      NUM_ORDERS >= 23  ~ 3,  # Median
      NUM_ORDERS >= 10  ~ 2,  # 1st quartile
      TRUE ~ 1  
    ),
    ORDER_INTERVAL_SCORE = case_when(
      AVG_DAYS_BET_ORD <= 5   ~ 10,
      AVG_DAYS_BET_ORD <= 13  ~ 9, # 1st quartile
      AVG_DAYS_BET_ORD <= 20  ~ 8,
      AVG_DAYS_BET_ORD <= 26  ~ 7, # Median
      AVG_DAYS_BET_ORD <= 30  ~ 6, 
      AVG_DAYS_BET_ORD <= 50  ~ 5,  
      AVG_DAYS_BET_ORD <= 100 ~ 4,
      AVG_DAYS_BET_ORD <= 210 ~ 3, # Mean
      AVG_DAYS_BET_ORD <= 300 ~ 2,
      TRUE ~ 1  
    )
  )

# Create Recency Score based on DAYS_AF_LAST_ORD
full_data_customer <- full_data_customer %>%
  mutate(
    RECENCY_SCORE = case_when(
      DAYS_AF_LAST_ORD <= 7   ~ 10,  
      DAYS_AF_LAST_ORD <= 13  ~ 9,  # 1st quartile
      DAYS_AF_LAST_ORD <= 20  ~ 8,  
      DAYS_AF_LAST_ORD <= 27  ~ 7,  #Median
      DAYS_AF_LAST_ORD <= 40  ~ 6,  
      DAYS_AF_LAST_ORD <= 50  ~ 5,
      DAYS_AF_LAST_ORD <= 72  ~ 4,  #Mean
      DAYS_AF_LAST_ORD <= 90  ~ 3,  #3rd quartile
      DAYS_AF_LAST_ORD <= 180 ~ 2,  #Six months
      TRUE ~ 1  
    )
  )

# Create Volume Score based on TOTAL_ORDERED
full_data_customer <- full_data_customer %>%
  mutate(
    VOLUME_SCORE = case_when(
      TOTAL_ORDERED >= 300000 ~ 10,  
      TOTAL_ORDERED >= 100000 ~ 9,
      TOTAL_ORDERED >= 5000   ~ 8,
      TOTAL_ORDERED >= 2000   ~ 7,
      TOTAL_ORDERED >= 1267   ~ 6,  # Mean 
      TOTAL_ORDERED >= 815    ~ 5,  # 3rd quartile
      TOTAL_ORDERED >= 400    ~ 4,  # Threshold
      TOTAL_ORDERED >= 302    ~ 3,  # Median
      TOTAL_ORDERED >= 200    ~ 2,  
      TRUE ~ 1  
    )
  )



# Calculate the overall RFM Score as the sum of Recency, Frequency, Order Interval, and Volume scores
full_data_customer <- full_data_customer %>%
  mutate(
    RFM_SCORE = RECENCY_SCORE + ORDER_FREQUENCY_SCORE + ORDER_INTERVAL_SCORE + VOLUME_SCORE
  )


# Count the number of customers in each RFM_SCORE range
rfm_distribution <- full_data_customer %>%
  mutate(RFM_CATEGORY = case_when(
    RFM_SCORE <= 10 ~ "4-10",
    RFM_SCORE <= 20 ~ "11-20",
    RFM_SCORE <= 30 ~ "21-30",
    TRUE ~ "31-40"
  )) %>%
  group_by(RFM_CATEGORY) %>%
  summarise(CUSTOMER_COUNT = n(), .groups = "drop") %>%
  mutate(PERCENTAGE = CUSTOMER_COUNT / sum(CUSTOMER_COUNT) * 100)

# Reorder RFM_CATEGORY to ensure it starts with scores between 4 and 10
rfm_distribution$RFM_CATEGORY <- factor(rfm_distribution$RFM_CATEGORY, 
                                        levels = c("4-10", "11-20", "21-30", "31-40"))

# Plot the distribution of RFM scores
ggplot(rfm_distribution, aes(x = RFM_CATEGORY, y = PERCENTAGE, fill = RFM_CATEGORY)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  geom_text(aes(label = paste0(round(PERCENTAGE, 1), "%")), vjust = -0.3, size = 4) +
  scale_fill_brewer(palette = "Set3") +  # Use Set3 color palette
  labs(title = "Distribution of Customers by RFM Score",
       x = "RFM Score Range",
       y = "Percentage of Customers") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(hjust = 0.5),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 11)
  )

Code
# Remove unnecessary intermediate data frame
rm(rfm_distribution)

The adapted RFM Score is a method developed to condense various pieces of information related to store consumption. It was observed that 60% of stores have a score up to 20 (the median), 32% have scores between 21-30, and 8.5% have scores above 30. This suggests that only a small percentage of stores exhibit high consumption patterns.

Code
# Filter only customers where LOCAL_FOUNT_ONLY == 1
rfm_distribution_lfo <- full_data_customer %>%
  filter(LOCAL_FOUNT_ONLY == 1) %>%
  mutate(RFM_CATEGORY = case_when(
    RFM_SCORE <= 10 ~ "4-10",
    RFM_SCORE <= 20 ~ "11-20",
    RFM_SCORE <= 30 ~ "21-30",
    TRUE ~ "31-40"
  )) %>%
  group_by(RFM_CATEGORY) %>%
  summarise(CUSTOMER_COUNT = n(), .groups = "drop") %>%
  mutate(PERCENTAGE = CUSTOMER_COUNT / sum(CUSTOMER_COUNT) * 100)

# Reorder RFM_CATEGORY to ensure it starts with scores between 4 and 10
rfm_distribution_lfo$RFM_CATEGORY <- factor(rfm_distribution_lfo$RFM_CATEGORY, 
                                            levels = c("4-10", "11-20", "21-30", "31-40"))

# Plot the distribution of RFM scores for LOCAL_FOUNT_ONLY == 1
ggplot(rfm_distribution_lfo, aes(x = RFM_CATEGORY, y = PERCENTAGE, fill = RFM_CATEGORY)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  geom_text(aes(label = paste0(round(PERCENTAGE, 1), "%")), vjust = -0.3, size = 4) +
  scale_fill_brewer(palette = "Set3") +  # Use Set3 color palette
  labs(title = "Distribution of Customers by RFM Score (LOCAL_FOUNT_ONLY = 1)",
       x = "RFM Score Range",
       y = "Percentage of Customers") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(hjust = 0.5),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 11)
  )

Code
# Remove unnecessary intermediate data frame
rm(rfm_distribution_lfo)

For customers who are local partners and consume only fountain drinks, it is clear that their consumption patterns are even lower. Nearly 74% of them have scores up to 20, and among the remaining customers, less than 3.6% have scores above 30.

5.3 Customer Demand and Growth

5.3.1 Low Demand Customers

It is known that a few customers exhibit very high consumption volumes, causing the average to be skewed above the median. The table below explores metrics related to customers whose demand falls below the first quartile.

Code
# Summarize the metrics
data_summary <- full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(
    Avg_Vol_Cust = round(mean((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024), na.rm = TRUE)),
    Median_Vol_Cust = round(median((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024), na.rm = TRUE)),
    First_Quartile_Vol = round(quantile((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024), 0.25, na.rm = TRUE)),
    .groups = 'drop'
  )

# Calculate the first quartile for each channel
quartile_data <- full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(
    First_Quartile_Val = round(quantile((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024), 0.25, na.rm = TRUE)),
    Tot_Vol = sum((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024), na.rm = TRUE),
    .groups = 'drop'
  )

# Calculate the number of customers below the first quartile and their total volume
below_quartile_stats <- full_data_customer %>%
  left_join(quartile_data %>% dplyr::select(COLD_DRINK_CHANNEL, First_Quartile_Val), by = "COLD_DRINK_CHANNEL") %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(
    Num_Customers_Below_1Q = sum(((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024)) <= First_Quartile_Val, na.rm = TRUE),
    Vol_Below_1Q = sum(((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024))[((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024)) <= First_Quartile_Val], na.rm = TRUE),
    .groups = 'drop'
  )

# Count the total number of customers per channel
customer_count <- full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(N_Cust = n(), .groups = 'drop')

# Combine all the data
final_summary <- customer_count %>%
  left_join(data_summary, by = "COLD_DRINK_CHANNEL") %>%
  left_join(quartile_data, by = "COLD_DRINK_CHANNEL") %>%
  left_join(below_quartile_stats, by = "COLD_DRINK_CHANNEL") %>%
  mutate(
    Vol_Perct = round((Vol_Below_1Q / Tot_Vol) * 100, 1),
    First_Quartile_Vol = as.integer(First_Quartile_Val)
  ) %>%
  dplyr::select(COLD_DRINK_CHANNEL, N_Cust, Avg_Vol_Cust, Median_Vol_Cust, First_Quartile_Vol, 
                Num_Customers_Below_1Q, Vol_Perct)

# Display the table with kable and styling
kable(final_summary, format = "html", escape = FALSE, align = "c", 
     col.names = c("Cold Drink Channel", "Total Cust.", "Avg. Vol Cust.", "Median Vol Cust.", 
                    "1st Quartile Qtd", "Cust. Below 1st Quart", "Vol % Below 1st Quart")) %>%
  kable_styling(full_width = FALSE, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:7, width = "6em") %>%
  row_spec(0, bold = TRUE, color = "black", background = "lightyellow") %>%
  add_header_above(c("Customers Analysis by Cold Drink Channel" = 7)) %>%
  kable_paper("striped", full_width = FALSE)
Customers Analysis by Cold Drink Channel
Cold Drink Channel Total Cust. Avg. Vol Cust. Median Vol Cust. 1st Quartile Qtd Cust. Below 1st Quart Vol % Below 1st Quart
ACCOMMODATION 1235 727 376 122 310 2.0
BULK TRADE 1320 7060 1420 444 330 0.7
CONVENTIONAL 57 190 99 39 15 3.1
DINING 15400 633 283 98 3860 1.8
EVENT 3074 1496 329 91 771 0.7
GOODS 5826 628 209 104 1465 2.1
PUBLIC SECTOR 1736 1085 283 94 435 1.1
WELLNESS 479 2413 625 182 119 0.9
WORKPLACE 1193 4046 200 87 303 0.3

For customers with total consumption volumes in 2023 and 2024 below the first quartile, the sums represent very low percentages, ranging from 0.3% to 3.1% of the total for each segment. In the dining segment, for example, 25% of customers showed demand below the first quartile.

Some of these customers have been identified as having high growth potential, as their demand growth is above average. This happens because any increase in demand from these low-volume customers results in higher growth percentages.

The low RFM scores also indicate that these customers have low recency, frequency, and total volume of purchases. Therefore, a flag, LOW_DEMAND_CUST, will be created, where a value of 1 will indicate low-consumption customers. With this flag, a white truck will be assigned to these customers, regardless of their growth indices.

Below are the cut volumes by segment:

Code
# Extract the list of 'Cold Drink Channel' and '1st Quartile Qty'
list_summary <- final_summary %>%
  dplyr::select(COLD_DRINK_CHANNEL, First_Quartile_Vol) %>%
  deframe()

# Display the list
list_summary
ACCOMMODATION    BULK TRADE  CONVENTIONAL        DINING         EVENT 
          122           444            39            98            91 
        GOODS PUBLIC SECTOR      WELLNESS     WORKPLACE 
          104            94           182            87 
Code
# Calculate the sum per row and assign LOW_DEMAND_CUST
full_data_customer <- full_data_customer %>%
  mutate(
    Total_Vol_Cust = (QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),
    LOW_DEMAND_CUST = if_else(Total_Vol_Cust <= list_summary[COLD_DRINK_CHANNEL], 1, 0)
  )

In the plot below, the numbers represent the percentages and the number of customers who received this flag.

Code
# Group and calculate the number of customers with LOW_DEMAND_CUST by LOCAL_FOUNT_ONLY
summary_low_demand <- full_data_customer %>%
  group_by(LOCAL_FOUNT_ONLY, LOW_DEMAND_CUST) %>%
  summarise(
    total_customers = n(),
    .groups = "drop"
  )

# Calculate the percentage for each group
summary_low_demand <- summary_low_demand %>%
  group_by(LOCAL_FOUNT_ONLY) %>%
  mutate(
    percentage = total_customers / sum(total_customers) * 100
  )

# Plot for percentages with LOW_DEMAND_CUST as fill and LOCAL_FOUNT_ONLY as groups
ggplot(summary_low_demand, aes(x = factor(LOCAL_FOUNT_ONLY), y = percentage, fill = factor(LOW_DEMAND_CUST))) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.6) +
  geom_text(aes(label = paste0(scales::comma(percentage, suffix = "%"), " (", total_customers, ")")),
            position = position_dodge(width = 0.8), vjust = -0.2, size = 3.5) +
  labs(title = "Percentage of Customers with Low Demand") +
  scale_fill_manual(values = c("0" = "darkolivegreen", "1" = "sandybrown"), 
                    labels = c("0" = "Others (Above Q1)", "1" = "Low Demand")) +  # Set colors and labels for LOW_DEMAND_CUST
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    legend.title = element_text(face = "bold", size = 10),  # Add legend title
    legend.position = "right",  # Position legend on the right side
    legend.box = "vertical",  # Ensure vertical arrangement for the legend
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    strip.text = element_text(size = 10, face = "bold"),
    strip.background = element_blank(),
    axis.text.x = element_text(size = 10, angle = 0),  # Display x-axis labels without rotation
    panel.spacing = unit(1, "lines"),
    strip.text.y = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  scale_x_discrete(labels = c("0" = "Others", "1" = "Local Fountain Only")) +  # Set x-axis labels
  guides(fill = guide_legend(title = "Low Demand Status"))  # Add legend title

5.3.2 Demand Variation between all stores

To measure demand growth patterns across our customer base (January 2023 - December 2024):

  1. Data Preparation: Combined monthly case and gallon deliveries for each customer into total monthly volumes.

  2. Eligibility: Required ≥6 months with positive orders for reliable analysis. Customers with <6 ordering months were classified as having no growth potential (6,026 customers).

  3. Growth Calculation:

    • Split each qualifying customer’s order history into two equal time periods
    • For odd numbers of months, divided the middle month equally between periods
    • Calculated growth rate as: (Second Period Total - First Period Total) / First Period Total
  4. Classification: Customers with growth rates exceeding the average positive growth rate were categorized as high growth potential (HIGH_GROW_POT = 1), while all others received a standard classification (HIGH_GROW_POT = 0).

Code
# Initialize new columns in the dataset
full_data_customer$NUM_POSITIVE_SUMS <- 0
full_data_customer$QTD_DLV_FIRST_HALF <- 0
full_data_customer$QTD_DLV_SECOND_HALF <- 0
full_data_customer$DEMAND_VARIATION <- NA  # Initialize as NA

# Process each customer individually
for (i in 1:nrow(full_data_customer)) {
  # Create a vector of positive sums while maintaining the chronological order
  POSITIVE_SUMS <- c()
  
  # Iterate over the 24 months in the correct sequence
  for (j in 1:24) {
    # Create column names
    year <- 2023 + (j - 1) %/% 12
    month <- (j - 1) %% 12 + 1
    
    CA_COL <- paste0("QTD_DLV_CA_", sprintf("%04d", year), "_", sprintf("%02d", month))
    GAL_COL <- paste0("QTD_DLV_GAL_", sprintf("%04d", year), "_", sprintf("%02d", month))
    
    # Check if columns exist in the dataset
    if (CA_COL %in% names(full_data_customer) && GAL_COL %in% names(full_data_customer)) {
      CA_VALUE <- full_data_customer[[CA_COL]][i]
      GAL_VALUE <- full_data_customer[[GAL_COL]][i]
      
      # Replace NA with 0
      CA_VALUE <- ifelse(is.na(CA_VALUE), 0, CA_VALUE)
      GAL_VALUE <- ifelse(is.na(GAL_VALUE), 0, GAL_VALUE)
      
      # Sum values for the month
      SUM_VALUE <- CA_VALUE + GAL_VALUE
      
      # Add to the list if positive
      if (SUM_VALUE > 0) {
        POSITIVE_SUMS <- c(POSITIVE_SUMS, SUM_VALUE)
      }
    }
  }
  
  # Total number of positive operations
  num_operations <- length(POSITIVE_SUMS)
  full_data_customer$NUM_POSITIVE_SUMS[i] <- num_operations
  
  # If fewer than 6 positive sums, set values accordingly and continue
  if (num_operations < 6) {
    full_data_customer$QTD_DLV_FIRST_HALF[i] <- 0
    full_data_customer$QTD_DLV_SECOND_HALF[i] <- 0
    full_data_customer$DEMAND_VARIATION[i] <- NA
    next
  }
  
  # Initialize the two halves
  QTD_DLV_FIRST_HALF <- 0
  QTD_DLV_SECOND_HALF <- 0
  
  # Split the operations into two halves
  if (num_operations %% 2 == 0) {
    # If even number of operations
    mid_point <- num_operations / 2
    QTD_DLV_FIRST_HALF <- sum(POSITIVE_SUMS[1:mid_point])
    QTD_DLV_SECOND_HALF <- sum(POSITIVE_SUMS[(mid_point + 1):num_operations])
  } else {
    # If odd number of operations
    mid_point <- (num_operations + 1) %/% 2
    
    # Split the middle value between both halves
    first_part <- if(mid_point > 1) POSITIVE_SUMS[1:(mid_point - 1)] else numeric(0)
    central_value <- POSITIVE_SUMS[mid_point] / 2
    second_part <- if(mid_point < num_operations) POSITIVE_SUMS[(mid_point + 1):num_operations] else numeric(0)
    
    QTD_DLV_FIRST_HALF <- sum(c(first_part, central_value))
    QTD_DLV_SECOND_HALF <- sum(c(central_value, second_part))
  }
  
  # Assign values to the dataset
  full_data_customer$QTD_DLV_FIRST_HALF[i] <- QTD_DLV_FIRST_HALF
  full_data_customer$QTD_DLV_SECOND_HALF[i] <- QTD_DLV_SECOND_HALF
  
  # Calculate demand variation
  if (QTD_DLV_FIRST_HALF > 0) {  # Avoid division by zero
    DEMAND_VARIATION_VALUE <- (QTD_DLV_SECOND_HALF - QTD_DLV_FIRST_HALF) / QTD_DLV_FIRST_HALF
    full_data_customer$DEMAND_VARIATION[i] <- DEMAND_VARIATION_VALUE
  } else {
    full_data_customer$DEMAND_VARIATION[i] <- NA
  }
}

# Create the HIGH_GROW_POT column
full_data_customer$HIGH_GROW_POT <- 0  # Initialize all values to 0

# Calculate the mean of DEMAND_VARIATION for positive values only
positive_variations <- full_data_customer$DEMAND_VARIATION[full_data_customer$DEMAND_VARIATION > 0]
if (length(positive_variations) > 0) {
  mean_value <- mean(positive_variations, na.rm = TRUE)
  
  # Display the calculated mean
  cat("Calculated mean of positive DEMAND_VARIATION: ", mean_value, "\n")
  
  # Assign 1 for customers with DEMAND_VARIATION greater than the mean
  full_data_customer$HIGH_GROW_POT <- ifelse(!is.na(full_data_customer$DEMAND_VARIATION) & 
                                            full_data_customer$DEMAND_VARIATION > mean_value, 
                                            1, 
                                            full_data_customer$HIGH_GROW_POT)
} else {
  cat("No positive DEMAND_VARIATION values found\n")
}
Calculated mean of positive DEMAND_VARIATION:  0.2843618 

Considering all customers, there was an average demand growth variation of 28%. However, 6,026 customers were excluded from the analysis as their growth could not be calculated due to having fewer than 6 periods of orders. For these customers, it was assumed that they have no growth potential.

Below, the number of customers whose growth exceeded the average, regardless of the segment.

Code
# Group and calculate the percentage of customers with HIGH_GROW_POT = 1 and 0 by LFO
summary_high_growth <- full_data_customer %>%
  group_by(LOCAL_FOUNT_ONLY) %>%
  summarise(
    high_growth = sum(HIGH_GROW_POT == 1, na.rm = TRUE),
    low_growth = sum(HIGH_GROW_POT == 0, na.rm = TRUE),
    total_customers = n(),
    .groups = "drop"
  ) %>%
  mutate(
    pct_high_growth = high_growth / total_customers * 100,
    pct_low_growth = low_growth / total_customers * 100
  )

# Transform data into long format for percentages
summary_high_growth_long <- summary_high_growth %>%
  pivot_longer(
    cols = starts_with("pct_"),
    names_to = "growth_type",
    values_to = "percentage"
  ) %>%
  mutate(
    growth_type = factor(growth_type, 
                         levels = c("pct_low_growth", "pct_high_growth"),  
                         labels = c("Low Growth Potential", "High Growth Potential"))
  )

# Ensure LFO is a factor
summary_high_growth_long$LOCAL_FOUNT_ONLY <- factor(summary_high_growth_long$LOCAL_FOUNT_ONLY, levels = c("0", "1"))

# Plot for percentages with the legend on the side
ggplot(summary_high_growth_long, aes(x = LOCAL_FOUNT_ONLY, y = percentage, fill = growth_type)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.6) +
  geom_text(aes(label = scales::comma(percentage, suffix = "%")), 
            position = position_dodge(width = 0.8), vjust = 0.2, size = 3.5) +
  labs(title = "Percentage of Customers Classified as Low or High Growth Potential") +
  scale_fill_manual(values = c("Low Growth Potential" = "#FF6347", "High Growth Potential" = "#40E0D0")) +  
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    legend.title = element_blank(),  # Remove legend title
    legend.position = "right",  # Position legend on the right side
    legend.box = "vertical",  # Ensure vertical arrangement for the legend
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    strip.text = element_text(size = 10, face = "bold"),
    strip.background = element_blank(),
    axis.text.x = element_text(size = 10),
    panel.spacing = unit(1, "lines"),
    strip.text.y = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  scale_x_discrete(labels = c("0" = "Others", "1" = "Local Fountain Only")) +
  guides(fill = guide_legend(title = "Growth Potential"))  # Add a legend title

Code
# Group and calculate the number of customers with HIGH_GROW_POT = 1 and 0 by LFO
summary_high_growth <- full_data_customer %>%
  group_by(LOCAL_FOUNT_ONLY) %>%
  summarise(
    low_growth = sum(HIGH_GROW_POT == 0, na.rm = TRUE),  
    high_growth = sum(HIGH_GROW_POT == 1, na.rm = TRUE),
    .groups = "drop"
  )

# Display the summary with the count of customers
#summary_high_growth

Approximately 9% of customers (123) identified as local market partners who purchase fountain-only products show high growth potential according to the established criteria. For other customers, about 12% (3450) are classified as having high growth potential.

Customers with high volumes are somewhat penalized by this criterion, as significant demand growth is more difficult to achieve. However, their substantial volume already places them as strategic partners, making them essential for close monitoring and prioritized deliveries via red trucks. For these customers, lower distribution costs allow for more competitive pricing, supporting the long-term sustainability of the partnership.

5.3.3 Demand Variation by Cold Drink Channel

Each customer’s growth potential was considered within their respective segment. Following the same criteria as before, only customers whose demand variation exceeded the segment average were classified as high potential. Below is the calculated demand variation for each Cold Drink Channel during the period.

Code
# Define the custom color palette for COLD_DRINK_CHANNEL with unique colors
cold_drink_channel_colors <- c("DINING" = "#A7ADC6", "PUBLIC SECTOR" = "#FF6347", "EVENT" = "#B33951", "WORKPLACE" = "#ABD2FA", "ACCOMMODATION" = "#E377C2", "GOODS" = "#FFD700", "BULK TRADE" = "#8ED081", "WELLNESS" = "#20B2AA", "CONVENTIONAL" = "#1F77B4")

# Aggregate data: mean DEMAND_VARIATION by channel
summary_growth_channel <- full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(CHANNEL_VAR = mean(DEMAND_VARIATION, na.rm = TRUE))

# Create horizontal bar chart
ggplot(summary_growth_channel, aes(x = CHANNEL_VAR, y = reorder(COLD_DRINK_CHANNEL, CHANNEL_VAR), fill = COLD_DRINK_CHANNEL)) +
  geom_bar(stat = "identity", alpha = 0.6) +
  geom_text(aes(label = paste0(round(CHANNEL_VAR * 100, 1), "%")), 
            hjust = -0.01, 
            color = "black", size = 3.2) +
  labs(title = "Average Demand Variation by Cold Drink Channel",
       x = "Percentage Variation (%)", 
       y = NULL) +  
  scale_x_continuous(labels = scales::label_percent(accuracy = 0.1), expand = expansion(c(0, 0.05))) +
  scale_fill_manual(values = cold_drink_channel_colors) +
  theme_minimal() +  
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(size = 10),  
    axis.title.x = element_text(size = 10),  
    legend.position = "none",
    panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_line(color = "gray", size = 0.5)
  )

Dining and bulk trade are the most important channels, with customers increasing their demand by 2.1% and 5.6%, respectively, on average.

Wellness experienced the highest variation at almost 10%, but it accounts for only 3.2% of the total volume sold. Goods had the second-highest variation, at 9%, and represents 10% of the total volume.

Code
# Calculate the mean DEMAND_VARIATION for each COLD_DRINK_CHANNEL
channel_means <- full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(MEAN_DEMAND_VARIATION = mean(DEMAND_VARIATION, na.rm = TRUE))

# Merge the mean values with the full_data_customer
full_data_customer <- full_data_customer %>%
  left_join(channel_means, by = "COLD_DRINK_CHANNEL")

# Create the CHANNEL_GROWTH_POT column
full_data_customer$CHANNEL_GROWTH_POT <- ifelse(
  is.na(full_data_customer$DEMAND_VARIATION), 0,
  ifelse(full_data_customer$DEMAND_VARIATION > full_data_customer$MEAN_DEMAND_VARIATION, 1, 0)
)

# Remove the MEAN_DEMAND_VARIATION column
full_data_customer <- full_data_customer %>% dplyr::select(-MEAN_DEMAND_VARIATION)

# Calculate the percentage of customers with high growth potential by channel
summary_growth_channel_customers <- full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(pct_high_growth = mean(CHANNEL_GROWTH_POT == 1, na.rm = TRUE) * 100)

# Create the horizontal bar chart
ggplot(summary_growth_channel_customers, aes(x = pct_high_growth, y = reorder(COLD_DRINK_CHANNEL, pct_high_growth), fill = COLD_DRINK_CHANNEL)) +
  geom_bar(stat = "identity", alpha = 0.6) +
  geom_text(aes(label = paste0(round(pct_high_growth, 1), "%")), 
            hjust = -0.01, 
            color = "black", size = 3.2) +
  labs(title = "Percentage of Customers with High Growth Potential by Cold Drink Channel",
       x = "Percentage of Customers (%)", 
       y = NULL) +  
  scale_x_continuous(labels = scales::label_number(accuracy = 1), expand = expansion(c(0, 0.05))) +
  scale_fill_manual(values = cold_drink_channel_colors) +  # Now correctly using the defined palette
  theme_minimal() +  
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(size = 10),  
    axis.title.x = element_text(size = 10),  
    legend.position = "none",  
    panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_line(color = "gray", size = 0.5)
  )

The majority of segments showed more than 30% of stores with growth above the average for their group. Only the ‘Events’ segment presented a lower percentage, close to 23%. These customers will be classified as high-growth in their respective segments.

The number of customers with a variation higher than the average for each cold drink channel significantly expands the high-potential customer base. Even when simulating the number of customers with 100% growth above the average, the base was still elevated. Therefore, this criterion will need further analysis before potentially being considered.

6. Correlations

Customer Features X RFM_SCORE

Seeking to understand how the variables correlate, based on our understanding of the dataset and with the goal of obtaining clear information without multicollinearity, we chose to select numeric variables and display only the most significant correlations (disregarding the range between -0.2 and 0.2).

Code
# List of selected variables
selected_vars <- c(
  "LOCAL_FOUNT_ONLY", "LOCAL_MARKET_PARTNER", "CO2_CUSTOMER", "CHAIN_MEMBER", 
  "DAYS_ONBOARDING", "DAYS_FIRST_DLV", 
  "OT_CALL.CENTER", "OT_OTHER", "OT_SALES.REP", "OT_MYCOKE.LEGACY", "OT_MYCOKE360", "OT_EDI", 
   "RFM_SCORE", "HIGH_GROW_POT", "CHANNEL_GROWTH_POT", "LOW_DEMAND_CUST", "TOTAL_COST_CA_GAL")

# Select only the numeric variables from the dataset
numeric_vars <- full_data_customer %>%
  dplyr::select(all_of(selected_vars)) %>%
  dplyr::select(where(is.numeric))

# Compute the correlation matrix (handling missing values)
cor_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")

# Replace NAs in correlation matrix with 0 to avoid errors
cor_matrix[is.na(cor_matrix)] <- 0

# Remove variables with a perfect correlation of 1
cor_matrix[cor_matrix == 1] <- NA  # Set correlations of 1 to NA to exclude them

# Convert correlation matrix to long format
cor_df <- as.data.frame(cor_matrix) %>%
  rownames_to_column(var = "Variable1") %>%
  pivot_longer(cols = -Variable1, names_to = "Variable2", values_to = "Correlation") %>%
  filter(!is.na(Correlation)) %>%  # Remove NAs which represent correlations of 1
  filter((Correlation >= 0.20 & Correlation <= 0.99) | (Correlation <= -0.20 & Correlation >= -0.99)) %>%  # Keep only correlations outside of the -0.20 to 0.20 range
  mutate(Correlation = round(Correlation, 2)) %>%  # Round correlations to 2 decimal places
  mutate(pair_id = paste0(pmin(Variable1, Variable2), "-", pmax(Variable1, Variable2))) %>%
  distinct(pair_id, .keep_all = TRUE) %>%  # Remove duplicate pairs (A-B, B-A)
  dplyr::select(Variable1, Variable2, Correlation) %>%
  arrange(desc(Correlation))  # Sort by correlation value from highest to lowest

# Display the correlation matrix in kable format with styling
cor_df %>%
  kable("html", col.names = c("Variable 1", "Variable 2", "Correlation")) %>%
  kable_styling(full_width = FALSE, position = "center") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:3, width = "6em") %>%
  row_spec(0, bold = TRUE, color = "black", background = "lightgray") %>%
  add_header_above(c("Correlations Between Selected Variables (over -+0.2)" = 3)) %>%
  kable_paper("striped", full_width = FALSE)
Code
# Compute the correlation matrix for the selected numeric variables
correlation_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")

# Replace NAs with 0 to avoid errors
correlation_matrix[is.na(correlation_matrix)] <- 0

# Visualize the correlation matrix with rotated labels
corrplot(correlation_matrix, method = "circle", type = "upper", 
         tl.cex = 0.8, tl.col = "black", tl.srt = 45, number.cex = 0.6, 
         diag = FALSE,  # Remove diagonal
         col = colorRampPalette(c("blue", "white", "red"))(200)) # Color palette

The strongest correlations were observed between Days Onboarding and Days After First Delivery (0.7) and between the order types MyCoke Legacy and MyCoke 360 (0.66). Both relationships make sense: customers who onboarded earlier tend to have older orders, except for cases where a new store belongs to an established chain. Similarly, customers who previously used the legacy channel transitioned to the newer 360 platform.

There is a correlation of 0.53 between overall customer growth and growth within the Cold Drink Channel, suggesting that expansion trends align across segments. The RFM Score also correlates with various variables that were not directly considered in its calculation, with correlations ranging from 0.44 to 0.27.

Among the negative correlations, the most notable is between RFM Score and Low Demand Customer (-0.65), indicating that lower RFM scores effectively capture low-demand customers.

Census X Total Ordered

All the correlations between the 2023 updated census data showed very low correlations, close to zero, in relation to the customers’ consumption patterns.

For this reason, these variables will be excluded, along with others no longer required, to streamline full_data_customer. However, the process will be retained in this document, as the company may obtain different results when applying real locations.

7. Customer Segmentation

Since all customers in the original dataset are served by red trucks, there is no prior information on the characteristics of those who would be served by white trucks. The only available reference is the average annual consumption threshold of 400 gallons or cases.

To address this, customers were segmented based on their most relevant characteristics within the available scope, including variables created during the analysis.

Variables selected represent store-level traits or consumption behavior, with geographic and census data excluded.

The variables selected are listed below:

Customer Type & Relationship:
These variables represent customers’ relationship with the company and their type:
- LOCAL_FOUNT_ONLY: Customers who only consume fountain drinks.
- LOCAL_MARKET_PARTNER: Local market partners.
- CO2_CUSTOMER: Customers who are CO2 consumers.
- CHAIN_MEMBER: Customers who are part of a chain.

Time-Related Metrics:
Time-related metrics track customers’ activity and engagement over time:
- DAYS_ONBOARDING: Number of days since onboarding.
- DAYS_FIRST_DLV: Number of days since the first delivery.
- DAYS_AF_LAST_ORD: Number of days after the last order.
- AVG_DAYS_BET_ORD: Average number of days between orders.

Order & Sales Behavior:
These variables represent customer behaviors in terms of orders and sales:
- NUM_ORDERS: Total number of orders.
- TOTAL_ORDERED: Total volume of orders.
- RFM_SCORE: Adapted Recency, Frequency, Monetary score.
- TOTAL_COST_CA_GAL: Total cost in deliveries for 2023 and 2024.

Order Channels:
This category contains data on the various channels through which customers make their transactions:
- OT_CALL.CENTER: Transactions via call center.
- OT_OTHER: Transactions made through other means (emails, trade fairs, etc.).
- OT_SALES.REP: Transactions via sales representatives.
- OT_MYCOKE: Transactions via MyCoke (legacy platform).
- OT_EDI: Transactions via electronic direct ordering (EDI).

Growth & Demand Potential:
These flags indicate customers’ growth and demand potential:
- HIGH_GROW_POT: Flag for customers with above-average growth potential across all segments.
- CHANNEL_GROWTH_POT: Flag for customers with above-average growth within their segment.
- LOW_DEMAND_CUST: Flag for customers with low demand (below the 1st quartile) by segment.

Three variables have a wide range of values with extreme outliers. For these variables—NUM_ORDERS, TOTAL_ORDERED, and TOTAL_COST_CA_GAL—we will apply a logarithmic transformation.

Code
# Select the primary variables for clustering based on business relevance
selected_vars <- c(
  "LOCAL_FOUNT_ONLY", "LOCAL_MARKET_PARTNER", "CO2_CUSTOMER", "CHAIN_MEMBER", 
  "DAYS_ONBOARDING", "DAYS_FIRST_DLV", "DAYS_AF_LAST_ORD", "AVG_DAYS_BET_ORD", 
  "OT_CALL.CENTER", "OT_OTHER", "OT_SALES.REP", "OT_MYCOKE.LEGACY", "OT_MYCOKE360", "OT_EDI", 
  "NUM_ORDERS", "TOTAL_ORDERED", "RFM_SCORE", "HIGH_GROW_POT", "CHANNEL_GROWTH_POT", "LOW_DEMAND_CUST", "TOTAL_COST_CA_GAL")

# Extract the data and apply log transformation to NUM_ORDERS and TOTAL_ORDERED
data_to_cluster <- full_data_customer %>%
  dplyr::select(all_of(selected_vars)) %>%
  dplyr::select(where(is.numeric))

# Apply log transformation directly on the selected numeric variables
data_to_cluster$DAYS_ONBOARDING <- log1p(data_to_cluster$DAYS_ONBOARDING)
data_to_cluster$DAYS_FIRST_DLV <- log1p(data_to_cluster$DAYS_FIRST_DLV)
data_to_cluster$TOTAL_ORDERED <- log1p(data_to_cluster$TOTAL_ORDERED)
data_to_cluster$TOTAL_COST_CA_GAL <- log1p(data_to_cluster$TOTAL_COST_CA_GAL)

# Standardize the numeric variables for clustering
data_to_cluster <- scale(data_to_cluster)

# Determine optimal number of clusters using the Elbow Method
set.seed(500)  
wss <- sapply(1:10, function(k) kmeans(data_to_cluster, centers = k, nstart = 25)$tot.withinss)

# Visualize the Elbow Method results
plot(1:10, wss, type = "b", pch = 19, frame = FALSE, 
     xlab = "Number of Clusters", ylab = "Total Within Sum of Squares (WSS)", 
     main = "Elbow Method for Optimal K")

After testing different compositions to calculate the silhouette score and ARI score—varying the number of clusters from 2 to 4, using multiple distance metrics (Euclidean, Manhattan), and applying different algorithms (Hartigan-Wong, MacQueen, Lloyd)—the most relevant metrics are presented below.

Code
# Set seed for reproducibility
set.seed(500)

# Function to calculate the Silhouette Score
calculate_silhouette_score <- function(model, data) {
  clusters <- model$cluster  
  if (length(clusters) != nrow(data)) {
    stop("Cluster assignments do not match the number of data points.")
  }
  dist_matrix <- dist(data)
  silhouette_score <- silhouette(clusters, dist_matrix)
  return(mean(silhouette_score[, 3]))  
}

# Function to calculate Adjusted Rand Index
calculate_ari <- function(model, true_labels) {
  clusters <- model$cluster
  ari_score <- adjustedRandIndex(clusters, true_labels)
  return(ari_score)
}

# Select the primary variables for clustering based on business relevance
selected_vars <- c(
  "LOCAL_FOUNT_ONLY", "LOCAL_MARKET_PARTNER", "CO2_CUSTOMER", "CHAIN_MEMBER", 
  "DAYS_ONBOARDING", "DAYS_FIRST_DLV", "DAYS_AF_LAST_ORD", "AVG_DAYS_BET_ORD", 
  "OT_CALL.CENTER", "OT_OTHER", "OT_SALES.REP", "OT_MYCOKE.LEGACY", "OT_MYCOKE360", "OT_EDI", 
  "NUM_ORDERS", "TOTAL_ORDERED", "RFM_SCORE", "HIGH_GROW_POT", "CHANNEL_GROWTH_POT", "LOW_DEMAND_CUST", "TOTAL_COST_CA_GAL")

# Extract the data and apply log transformation to NUM_ORDERS and TOTAL_ORDERED
data_to_cluster <- full_data_customer %>%
  dplyr::select(all_of(selected_vars)) %>%
  dplyr::select(where(is.numeric))

# Apply log transformation directly on the selected numeric variables
data_to_cluster$DAYS_ONBOARDING <- log1p(data_to_cluster$DAYS_ONBOARDING)
data_to_cluster$DAYS_FIRST_DLV <- log1p(data_to_cluster$DAYS_FIRST_DLV)
data_to_cluster$TOTAL_ORDERED <- log1p(data_to_cluster$TOTAL_ORDERED)
data_to_cluster$TOTAL_COST_CA_GAL <- log1p(data_to_cluster$TOTAL_COST_CA_GAL)

# Standardize the numeric variables for clustering
data_to_cluster <- scale(data_to_cluster)

# Define different parameter configurations for K-means
params <- list(
  list(name = "Euclidean, 2 Clusters", centers = 2, nstart = 25, algorithm = "Hartigan-Wong"),
  list(name = "Euclidean, 3 Clusters", centers = 3, nstart = 25, algorithm = "Hartigan-Wong"),
  list(name = "Euclidean, 4 Clusters", centers = 4, nstart = 25, algorithm = "Hartigan-Wong")
#  ,
  
#  list(name = "Manhattan, 3 Clusters", centers = 2, nstart = 25, algorithm = "MacQueen"),
#  list(name = "Manhattan, 2 Clusters", centers = 3, nstart = 25, algorithm = "MacQueen"),
#  list(name = "Manhattan, 4 Clusters", centers = 4, nstart = 25, algorithm = "MacQueen"),
  
#  list(name = "K-means++, 3 Clusters", centers = 2, nstart = 25, algorithm = "Lloyd"),
#  list(name = "K-means++, 2 Clusters", centers = 3, nstart = 25, algorithm = "Lloyd"),
#  list(name = "K-means++, 4 Clusters", centers = 4, nstart = 25, algorithm = "Lloyd")
)

# Apply K-means clustering and store results
results <- lapply(params, function(param) {
  model <- kmeans(data_to_cluster, centers = param$centers, nstart = param$nstart, algorithm = param$algorithm)
  silhouette <- calculate_silhouette_score(model, data_to_cluster)
  ari <- calculate_ari(model, full_data_customer$LOCAL_MARKET_PARTNER)  # You can replace with a true label column if needed
  return(data.frame(Model = param$name, Silhouette_Score = round(silhouette, 3), ARI = round(ari, 3)))
})

# Combine results into a single table
results_df <- do.call(rbind, results)

# Display table using kable
kable(results_df, col.names = c("Parameter", "Silhouette Score", "Adjusted Rand Index (ARI)"))
Parameter Silhouette Score Adjusted Rand Index (ARI)
Euclidean, 2 Clusters 0.210 -0.054
Euclidean, 3 Clusters 0.180 0.043
Euclidean, 4 Clusters 0.176 0.048

Given the results, the combination “Euclidean, 3 Clusters” was selected, using centers = 3, nstart = 25, and the “Hartigan-Wong” algorithm (default), as it demonstrated the best performance among the tested options. Still, the separation between clusters remains marginal and relatively weak.

Below is the visualization of the clusters based on the two principal components.

Code
# Implement K-means with optimal number of clusters 
set.seed(500)
kmeans_optimal <- kmeans(data_to_cluster, centers = 3, nstart = 25, algorithm = "Hartigan-Wong")

# Add cluster assignments to the original dataset
full_data_customer$CLUSTER <- kmeans_optimal$cluster

# Define custom colors for the clusters
palette_clusters <- c(
  "1" = "#FF6347",  # Coral
  "2" = "#4682B4",  # Cornflower blue
  "3" = "#FFD700")   # Yellow

# Visualize cluster distribution with PCA-reduced dimensions
fviz_cluster(kmeans_optimal, data = data_to_cluster, geom = "point", 
             ellipse.type = "none", 
             main = "Customer Segmentation: PCA-based Visualization",
             subtitle = "K-means Optimal Clustering with 3 Segments",
             ggtheme = theme_minimal()) +
  scale_color_manual(values = palette_clusters) # Manually set colors

The customer segmentation will be discussed later, including the interpretation of each cluster.

7.1 Clusters and principal components

Given the visualization of the clusters through their principal components, the decision was made to further explore the characteristics of the two main components, as they account for 39% of the total variability.

Code
# Select the desired variables for clustering - adjusted to match your clustering selection
selected_vars <- c(
  "LOCAL_FOUNT_ONLY", "LOCAL_MARKET_PARTNER", "CO2_CUSTOMER", "CHAIN_MEMBER",
  "DAYS_ONBOARDING", "DAYS_FIRST_DLV", "DAYS_AF_LAST_ORD", "AVG_DAYS_BET_ORD", 
  "OT_CALL.CENTER", "OT_OTHER", "OT_SALES.REP", "OT_MYCOKE.LEGACY", "OT_MYCOKE360", "OT_EDI", 
  "NUM_ORDERS", "TOTAL_ORDERED", "RFM_SCORE", "HIGH_GROW_POT", "CHANNEL_GROWTH_POT", "LOW_DEMAND_CUST", "TOTAL_COST_CA_GAL")

# Select only the desired variables from full_data_customer - works for both data.frame and data.table
customer_data <- full_data_customer[, selected_vars]

# Remove rows with NA values (if any)
customer_data <- na.omit(customer_data)

# Apply log transformation to the same variables as in clustering
customer_data$DAYS_ONBOARDING <- log1p(customer_data$DAYS_ONBOARDING)
customer_data$DAYS_FIRST_DLV <- log1p(customer_data$DAYS_FIRST_DLV)
customer_data$TOTAL_ORDERED <- log1p(customer_data$TOTAL_ORDERED)
customer_data$TOTAL_COST_CA_GAL <- log1p(customer_data$TOTAL_COST_CA_GAL)

# Build scales for the dataset
scales <- build_scales(customer_data, verbose = FALSE)

# Scaling columns
customer_data <- fast_scale(customer_data, scales = scales, verbose = FALSE)

# Calculating the covariance matrix
cov_customer <- cov(customer_data)

# Calculating the Eigenvector and Eigenvalues of the variance-covariance matrix
e_customer <- eigen(cov_customer)
eigenvalues_customer <- e_customer$values
eigenvectors_customer <- e_customer$vectors

# Print
#print(paste("Counts the number of eigenvalues:", length(eigenvalues_customer)))

# Initialize an empty matrix to store the contributions of variables to all PCs
contributions_matrix <- matrix(NA, nrow = ncol(customer_data), ncol = ncol(eigenvectors_customer))

# Loop through all principal components
for (i in 1:ncol(eigenvectors_customer)) {
  # Get the contributions of variables to the i-th principal component (PC)
  pc_contributions <- eigenvectors_customer[, i]
  
  # Assign the contributions to the corresponding column in the matrix
  contributions_matrix[, i] <- round(pc_contributions, 2)  # Round to 2 decimal places
}

# Convert the matrix to a data frame and assign appropriate row and column names
contributions_df <- as.data.frame(contributions_matrix)
colnames(contributions_df) <- paste0("PC", 1:ncol(contributions_matrix))  # Name the columns dynamically
rownames(contributions_df) <- colnames(customer_data)  # Assign the variable names as row names

# Variance Explained
# Calculate the variance explained by each principal component
variance_explained <- eigenvalues_customer / sum(eigenvalues_customer)

# Round the variance explained to 2 decimal places
variance_row <- round(variance_explained, 2)

# Calculate the cumulative variance explained
cumulative_variance <- cumsum(variance_explained)

# Round the cumulative variance to 2 decimal places
cumulative_variance_row <- round(cumulative_variance, 2)

# Add the variance and cumulative variance rows to the bottom of the data frame
contributions_df <- rbind(contributions_df, 
                        Variance_Explained = variance_row,
                        Cumulative_Variance = cumulative_variance_row)

# Format the table using formattable for heatmap effect
formattable(contributions_df, 
          list(
            # Apply color gradient to all columns
            area(col = 1:ncol(contributions_df)) ~ color_tile("white", "deepskyblue3")
          ))
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 PC21
LOCAL_FOUNT_ONLY -0.05 0.00 0.00 0.03 -0.18 0.29 0.73 -0.24 0.00 0.40 -0.14 -0.16 -0.28 0.01 -0.01 0.01 0.04 -0.04 0.03 0.04 0.00
LOCAL_MARKET_PARTNER -0.12 0.02 0.22 -0.14 -0.13 0.35 0.18 0.41 0.29 -0.52 0.24 0.13 -0.38 -0.03 0.02 -0.06 0.02 0.04 0.01 -0.02 0.00
CO2_CUSTOMER 0.00 -0.02 0.24 -0.36 0.52 -0.10 -0.25 0.08 0.10 0.37 -0.04 -0.14 -0.54 -0.02 -0.04 0.03 0.06 -0.05 0.05 0.05 0.00
CHAIN_MEMBER 0.14 0.00 -0.30 0.30 -0.36 -0.18 -0.25 -0.11 -0.23 -0.17 -0.07 -0.20 -0.65 -0.10 -0.01 -0.12 0.00 0.01 -0.01 -0.03 0.00
DAYS_ONBOARDING 0.20 -0.43 -0.17 0.25 0.26 0.00 0.11 0.10 0.19 -0.04 0.12 -0.24 0.03 0.03 0.07 0.00 -0.02 0.42 0.54 0.00 0.00
DAYS_FIRST_DLV 0.22 -0.43 -0.14 0.26 0.26 0.01 0.11 0.10 0.21 -0.07 0.06 -0.14 0.00 0.02 -0.01 0.01 0.01 -0.39 -0.61 0.03 0.00
DAYS_AF_LAST_ORD -0.22 -0.10 -0.23 0.22 0.19 -0.17 0.11 0.04 0.06 0.13 -0.11 0.78 -0.18 -0.04 -0.05 -0.22 0.08 0.11 0.00 0.14 0.00
AVG_DAYS_BET_ORD -0.32 0.07 -0.12 0.04 0.02 -0.12 0.02 0.00 -0.03 0.08 0.29 -0.06 -0.07 -0.03 0.74 0.29 -0.21 0.10 -0.15 0.21 0.00
OT_CALL.CENTER 0.12 -0.34 0.00 -0.11 -0.04 0.48 -0.22 -0.20 -0.37 0.15 0.39 0.24 -0.03 0.02 0.06 -0.09 -0.09 -0.10 0.06 -0.07 -0.37
OT_OTHER 0.04 -0.01 -0.05 -0.06 0.05 -0.19 0.27 0.67 -0.62 0.09 0.03 -0.08 0.04 -0.04 -0.07 -0.03 -0.05 -0.03 0.01 0.01 -0.09
OT_SALES.REP 0.14 -0.03 -0.10 -0.42 -0.01 -0.52 0.31 -0.26 0.12 -0.24 0.15 0.02 -0.03 0.00 0.02 -0.14 -0.11 -0.10 0.06 -0.05 -0.46
OT_MYCOKE.LEGACY 0.19 0.42 0.03 0.39 0.20 0.06 0.04 0.02 0.07 -0.01 0.08 0.12 -0.09 -0.04 -0.22 0.53 -0.21 -0.06 0.07 -0.06 -0.41
OT_MYCOKE360 0.16 0.44 0.16 0.30 0.21 0.05 0.02 0.01 0.04 0.08 0.18 -0.12 0.06 0.05 0.26 -0.68 0.06 -0.06 0.01 -0.02 -0.10
OT_EDI 0.07 0.00 -0.19 -0.05 -0.47 -0.05 -0.19 0.39 0.48 0.49 0.08 0.00 0.07 -0.03 -0.03 -0.07 -0.09 -0.09 0.04 0.00 -0.22
NUM_ORDERS 0.34 0.12 -0.10 -0.09 -0.01 -0.10 0.10 -0.06 0.01 0.10 0.44 0.20 -0.06 -0.02 -0.08 0.04 -0.29 -0.20 0.13 -0.12 0.65
TOTAL_ORDERED 0.37 0.07 -0.03 -0.11 -0.02 -0.01 0.05 0.05 0.00 0.09 -0.05 0.16 -0.04 0.04 0.28 0.15 0.36 0.39 -0.28 -0.59 0.00
RFM_SCORE 0.37 0.11 0.02 -0.14 -0.07 0.04 0.00 -0.05 -0.02 0.02 0.13 0.01 0.01 0.01 -0.19 -0.02 0.02 0.49 -0.32 0.65 0.00
HIGH_GROW_POT 0.02 -0.20 0.53 0.24 -0.22 -0.29 0.01 0.00 -0.04 0.07 0.08 0.09 -0.09 0.67 0.00 0.06 -0.04 0.01 0.00 0.01 0.00
CHANNEL_GROWTH_POT 0.09 -0.21 0.55 0.19 -0.15 -0.20 0.03 -0.05 0.00 0.07 0.03 0.07 0.04 -0.72 0.07 0.01 -0.01 0.02 0.00 0.00 0.00
LOW_DEMAND_CUST -0.30 0.05 -0.08 0.10 -0.02 -0.15 0.01 -0.07 -0.01 0.07 0.58 -0.13 0.03 -0.05 -0.28 0.10 0.64 -0.02 0.00 -0.01 0.00
TOTAL_COST_CA_GAL 0.36 0.03 -0.02 -0.04 -0.06 0.03 0.00 0.06 -0.02 -0.08 -0.17 0.15 0.02 0.04 0.33 0.19 0.49 -0.42 0.33 0.36 0.01
Variance_Explained 0.30 0.09 0.07 0.07 0.06 0.06 0.05 0.05 0.05 0.04 0.03 0.03 0.02 0.02 0.02 0.02 0.01 0.00 0.00 0.00 0.00
Cumulative_Variance 0.30 0.39 0.46 0.54 0.60 0.66 0.71 0.76 0.80 0.84 0.87 0.90 0.93 0.95 0.96 0.98 0.99 0.99 1.00 1.00 1.00

Principal Component 1 has the highest weight from the variables RFM_SCORE, NUM_ORDERS, TOTAL_ORDERED, and TOTAL_COST_CA_GAL, representing 30% of the variance.
Principal Component 2 adds another 9% of variance, with the highest weight from the OT_MYCOKE variables.

7.2 Clusters Features

The clusters will be characterized based on their relationships with other variables.

Code
# Define specific colors for fleet types
fleet_colors <- c("RED TRUCK" = "#B33951", "WHITE TRUCK" = "#D3D3D3")  # Custom colors for FleetType

# Create a cross-tabulation of CLUSTER and FLEET_TYPE
cluster_fleet_table <- table(full_data_customer$CLUSTER, full_data_customer$FLEET_TYPE, 
                            useNA = "ifany")

# Create data frame for visualization
cluster_fleet_df <- as.data.frame.table(cluster_fleet_table)
names(cluster_fleet_df) <- c("Segment", "FleetType", "Count")

# Filter out NA values for cleaner visualization
cluster_fleet_df <- cluster_fleet_df %>% 
  filter(!is.na(Segment) & !is.na(FleetType))

# Calculate proportions
cluster_fleet_df$Pct <- cluster_fleet_df$Count / ave(cluster_fleet_df$Count, cluster_fleet_df$Segment, FUN = sum)

# Create percentage distribution plot for fleet types within clusters
ggplot(cluster_fleet_df, aes(x = Segment, y = Pct, fill = FleetType)) +
  geom_bar(stat = "identity", position = "fill", width = 0.7) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = fleet_colors) +  # Use custom colors for fleet types
  geom_text(aes(label = scales::percent(Pct, accuracy = 0.1)), 
            position = position_fill(vjust = 0.5), 
            color = "black", size = 3.5) +  # Add percentage text labels inside the bars
  labs(title = "Fleet Type Distribution Across Clusters",
       subtitle = "Fleet type classification using a 400-gallon threshold",
       x = "Cluster",
       y = "Percentage",
       fill = "Fleet Type") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 12, color = "gray30"),
    legend.position = "right",
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank()
  )

Code
# Define specific colors for the other variables
high_growth_colors <- c("1" = "#FF6347", "0" = "#D3D3D3")  # High Growth vs Low Growth
fountain_only_colors <- c("1" = "#4682B4", "0" = "#D3D3D3")  # Fountain Only vs Not Fountain Only

# Create data frame for HIGH_GROW_POT visualization
cluster_high_growth_df <- as.data.frame.table(table(full_data_customer$CLUSTER, full_data_customer$HIGH_GROW_POT))
names(cluster_high_growth_df) <- c("Segment", "HighGrowth", "Count")

# Filter out NA values for cleaner visualization
cluster_high_growth_df <- cluster_high_growth_df %>% 
  filter(!is.na(Segment) & !is.na(HighGrowth))

# Calculate proportions for HIGH_GROW_POT
cluster_high_growth_df$Pct <- cluster_high_growth_df$Count / ave(cluster_high_growth_df$Count, cluster_high_growth_df$Segment, FUN = sum)

# Plot for HIGH_GROW_POT distribution by clusters
ggplot(cluster_high_growth_df, aes(x = Segment, y = Pct, fill = HighGrowth)) +
  geom_bar(stat = "identity", position = "fill", width = 0.7) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = high_growth_colors) +  # Custom colors for HIGH_GROW_POT
  geom_text(aes(label = scales::percent(Pct, accuracy = 0.1)), 
            position = position_fill(vjust = 0.5), 
            color = "black", size = 3.5) +  # Add percentage text labels inside the bars
  labs(title = "Clusters by Growth Potential",
       subtitle = "Proportional Representation by High Growth Potential",
       x = "Cluster",
       y = "Percentage",
       fill = "Growth Potential") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 12, color = "gray30"),
    legend.position = "right",
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank()
  )

Code
# Create data frame for LOCAL_FOUNT_ONLY visualization
cluster_fountain_df <- as.data.frame.table(table(full_data_customer$CLUSTER, full_data_customer$LOCAL_FOUNT_ONLY))
names(cluster_fountain_df) <- c("Segment", "FountainOnly", "Count")

# Filter out NA values for cleaner visualization
cluster_fountain_df <- cluster_fountain_df %>% 
  filter(!is.na(Segment) & !is.na(FountainOnly))

# Calculate proportions for LOCAL_FOUNT_ONLY
cluster_fountain_df$Pct <- cluster_fountain_df$Count / ave(cluster_fountain_df$Count, cluster_fountain_df$Segment, FUN = sum)

# Plot for LOCAL_FOUNT_ONLY distribution by clusters
ggplot(cluster_fountain_df, aes(x = Segment, y = Pct, fill = FountainOnly)) +
  geom_bar(stat = "identity", position = "fill", width = 0.7) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = fountain_only_colors) +  # Custom colors for LOCAL_FOUNT_ONLY
  geom_text(aes(label = scales::percent(Pct, accuracy = 0.1)), 
            position = position_fill(vjust = 0.5), 
            color = "black", size = 3.5) +  # Add percentage text labels inside the bars
  labs(title = "Clusters by Fountain Only",
       subtitle = "Proportional Representation by Fountain Only",
       x = "Cluster",
       y = "Percentage",
       fill = "Fountain Only") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 12, color = "gray30"),
    legend.position = "right",
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank()
  )

Code
# Define specific colors for the LOW_DEMAND_CUST variable
low_demand_colors <- c("1" = "yellow", "0" = "#D3D3D3")  # Low Demand vs Not Low Demand

# Create data frame for LOW_DEMAND_CUST visualization
cluster_low_demand_df <- as.data.frame.table(table(full_data_customer$CLUSTER, full_data_customer$LOW_DEMAND_CUST))
names(cluster_low_demand_df) <- c("Segment", "LowDemand", "Count")

# Filter out NA values for cleaner visualization
cluster_low_demand_df <- cluster_low_demand_df %>% 
  filter(!is.na(Segment) & !is.na(LowDemand))

# Calculate proportions for LOW_DEMAND_CUST
cluster_low_demand_df$Pct <- cluster_low_demand_df$Count / ave(cluster_low_demand_df$Count, cluster_low_demand_df$Segment, FUN = sum)

# Plot for LOW_DEMAND_CUST distribution by clusters
ggplot(cluster_low_demand_df, aes(x = Segment, y = Pct, fill = LowDemand)) +
  geom_bar(stat = "identity", position = "fill", width = 0.7) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = low_demand_colors) +  # Custom colors for LOW_DEMAND_CUST
  geom_text(aes(label = scales::percent(Pct, accuracy = 0.1)), 
            position = position_fill(vjust = 0.5), 
            color = "black", size = 3.5) +  # Add percentage text labels inside the bars
  labs(title = "Clusters by Low Demand Customers",
       subtitle = "Proportional Representation by Low Demand Customers",
       x = "Cluster",
       y = "Percentage",
       fill = "Low Demand Customers") +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 12, color = "gray30"),
    legend.position = "right",
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank()
  )

Code
# Define colors from the custom palette for clusters
# Reshape data for faceting
plot_data <- melt(full_data_customer, id.vars = "CLUSTER", measure.vars = c("RFM_SCORE", "NUM_ORDERS", "TOTAL_ORDERED"))

# Create a new variable for log-transformed TOTAL_ORDERED
full_data_customer$LOG_TOTAL_ORDERED <- log1p(full_data_customer$TOTAL_ORDERED)  # log1p to handle zero values

# Reshape data using tidyr::pivot_longer()
plot_data <- full_data_customer %>%
  pivot_longer(cols = c(RFM_SCORE, NUM_ORDERS, LOG_TOTAL_ORDERED),
               names_to = "variable", values_to = "value")

# Rename variable levels for better readability
plot_data$variable <- case_when(
  plot_data$variable == "LOG_TOTAL_ORDERED" ~ "TOTAL_ORDERED (Log Scale)",
  TRUE ~ plot_data$variable
)

# Create a boxplot with facet_wrap
ggplot(plot_data, aes(x = factor(CLUSTER), y = value, fill = factor(CLUSTER))) +
  geom_boxplot(color = "black", alpha = 0.7) +  # Add black borders for contrast
  scale_fill_manual(values = palette_clusters) +  # Apply custom colors for clusters
  facet_wrap(~ variable, scales = "free_y") +  # Allow different y-scales per variable
  labs(
    title = "Customer Segmentation Characterization", 
    subtitle = "Distribution of RFM Score, Number of Orders, and Log-Transformed Total Ordered for each cluster",
    x = "Cluster", 
    y = "Value"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11, color = "gray30"),
    axis.title = element_text(face = "bold"),
    panel.grid.major = element_line(color = "gray90"),
    panel.grid.minor = element_blank(),
    legend.position = "none"  # Remove legend since clusters are already labeled on the x-axis
  )

Code
# Prepare the dataset
plot_data_filtered <- full_data_customer %>%
  pivot_longer(cols = c("OT_CALL.CENTER", "OT_OTHER", "OT_SALES.REP", 
                        "OT_MYCOKE.LEGACY", "OT_MYCOKE360", "OT_EDI"),
               names_to = "variable", values_to = "value") %>%
  mutate(value = log1p(value))  # Log-transform values safely to handle zero values

# Define custom labels for the variables
custom_labels <- c(
  "OT_CALL.CENTER" = "Call Center",
  "OT_OTHER" = "Other",
  "OT_SALES.REP" = "Sales Rep",
  "OT_MYCOKE.LEGACY" = "MyCoke Legacy",
  "OT_MYCOKE360" = "MyCoke360",
  "OT_EDI" = "EDI"
)

# Generate the boxplot
ggplot(plot_data_filtered, aes(x = factor(CLUSTER), y = value, fill = factor(CLUSTER))) +
  geom_boxplot(color = "black", alpha = 0.7) +  # Add black borders for contrast
  scale_fill_manual(values = palette_clusters) +  # Apply custom colors for clusters
  facet_wrap(~ variable, scales = "fixed", labeller = labeller(variable = as_labeller(custom_labels))) + 
  scale_y_continuous(limits = c(0, 6), breaks = seq(0, 6, 1)) +  # Set fixed scale for y-axis
  labs(
  title = "Customer Segmentation Characterization", 
  subtitle = "Distribution of orders by Order Type (Log Scale) for each cluster",
    x = "Cluster",
    y = "Log(Value)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11, color = "gray30"),
    axis.title = element_text(face = "bold"),
    panel.grid.major = element_line(color = "gray90"),
    panel.grid.minor = element_blank(),
    legend.position = "none"  # Remove legend since clusters are already labeled on the x-axis
  )

Cluster 1 (Red): High Demand Customers
- Composition: Approximately 80% of customers receive deliveries via red trucks (based on the benchmark threshold of 400 gallons on average per year).
- Growth: Around 7% of customers exhibit high growth potential.
- Local Fountain Only: Only 1.5% of customers are local fountain-only.
- Average RFM: The average RFM score for this cluster is 29, the highest among the three clusters.
- Average Number of Orders: The average number of orders per customer was 81 in 2023 and 2024, with many outliers showing significantly higher order volumes.
- Total Ordered Volume: The average total ordered volume per customer in 2023 and 2024 is 4,638 gallons. This cluster has the highest number of outliers with elevated volumes, which skews the average. The volume representing the median is 1,707 gallons. - Volume Share: This cluster represents 76% of the total volume consumed in 2023 and 2024.
- It has the highest number of orders through digital channels and is the cluster most served by sales representatives.

Cluster 2 (Blue): Intermediate Customers with Growth Potential
- Composition: Approximately 87% of customers receive deliveries via white trucks (based on the benchmark threshold of 400 gallons on average per year).
- Growth: This cluster has the highest percentage of customers with high growth potential, at 16.6%.
- Local Fountain Only: Around 4.2% of customers are local fountain-only.
- Average RFM: The average RFM score for this group is 18.7, the second highest among the clusters.
- Average Number of Orders: The average number of orders per customer was 30 in 2023 and 2024.
- Total Ordered Volume: The average total ordered volume per customer in 2023 and 2024 is 525 gallons. The median volume is 331 gallons.
- Volume Share: This cluster represents approximately 22% of the total volume consumed in 2023 and 2024.
- It is the cluster with the highest average number of orders placed via the call center. It has fewer orders through digital channels compared to Cluster 2, but more than Cluster 1. The number of orders through sales representatives is similar to Cluster 1

Cluster 3 (Yellow): Less Active Customers with Low Order Volume
- Composition: Only 0.4% of customers receive deliveries via red trucks (based on the benchmark threshold of 400 gallons on average per year).
- Growth: Approximately 6% of customers exhibit high growth potential.
- Local Fountain Only: This cluster has the highest percentage of local fountain-only customers, at 7.5%.
- Average RFM: The average RFM score is 7, indicating these are the least active customers.
- Average Number of Orders: The average number of orders per customer was 5.5 in 2023 and 2024.
- Total Ordered Volume: The average total ordered volume per customer in 2023 and 2024 is around 80 gallons, while the median is 57 gallons, indicating a large number of customers with smaller volumes.
- Volume Share: This cluster represents only 1.7% of the total volume consumed in 2023 and 2024.
- The cluster shows orders concentrated through call centers, digital channels, and sales representatives, although in smaller absolute quantities compared to the other clusters.

8. Classification Models for Explaining Clusters

To better understand the variables influencing cluster composition and facilitate future predictions without the need for re-clustering, two classification models will be used: decision trees and multinomial logistic regression. These models will help identify the key characteristics that drive cluster formation.

By applying these models to new data, cluster assignments can be predicted, streamlining the analysis process and eliminating the need to recreate the clusters whenever new data is introduced.

8.1 Decision Tree

The selected variables will be analyzed to explain the clusters using a decision tree, with the dataset split into training and test sets, applying 20-fold cross-validation.

Code
# Prepare data for decision tree
model_data <- full_data_customer %>%
  dplyr::select(all_of(selected_vars), CLUSTER) %>%
  mutate(CLUSTER = as.factor(CLUSTER)) 

# Create train/test split (70% train, 30% test)
set.seed(500)  
train_indices <- createDataPartition(model_data$CLUSTER, p = 0.7, list = FALSE)
train_data <- model_data[train_indices, ]
test_data <- model_data[-train_indices, ]

# Set up cross-validation (20-fold)
train_control <- trainControl(method = "cv", number = 20)

# Train the decision tree model with cross-validation
decision_tree_model <- train(
  CLUSTER ~ ., 
  data = train_data,  
  method = "rpart", 
  trControl = train_control,
  tuneLength = 5)  

# Plot the decision tree
rpart.plot(
  decision_tree_model$finalModel, 
  extra = 101,                
  box.palette = "Blues",      
  shadow.col = "gray",        
  nn = TRUE,                  
  main = "Decision Tree: Explaining Customer Clusters", 
  branch.col = "gray",        
  faclen = 0,
  tweak = 1.1)                  

Below are the prediction performance metrics:

Code
# Evaluate model performance on test set
dt_test_predictions <- predict(decision_tree_model, test_data, type = "raw")
dt_test_confusion_matrix <- confusionMatrix(dt_test_predictions, test_data$CLUSTER)

# Calculate accuracy on the test set
dt_test_accuracy <- round(mean(dt_test_predictions == test_data$CLUSTER), 2)

# Evaluate model performance on train set
dt_train_predictions <- predict(decision_tree_model, train_data, type = "raw")
dt_train_accuracy <- round(mean(dt_train_predictions == train_data$CLUSTER), 2)

# Print model performance metrics
cat("\n--- Decision Tree Model Performance ---\n")

--- Decision Tree Model Performance ---
Code
print(dt_test_confusion_matrix)
Confusion Matrix and Statistics

          Reference
Prediction    1    2    3
         1 1568  170    0
         2  332 4336   72
         3    1  270 2346

Overall Statistics
                                         
               Accuracy : 0.9071         
                 95% CI : (0.9009, 0.913)
    No Information Rate : 0.5251         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.8477         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      

Statistics by Class:

                     Class: 1 Class: 2 Class: 3
Sensitivity            0.8248   0.9079   0.9702
Specificity            0.9764   0.9065   0.9594
Pos Pred Value         0.9022   0.9148   0.8964
Neg Pred Value         0.9547   0.8990   0.9889
Prevalence             0.2090   0.5251   0.2659
Detection Rate         0.1724   0.4767   0.2579
Detection Prevalence   0.1911   0.5212   0.2877
Balanced Accuracy      0.9006   0.9072   0.9648
Code
# Evaluate accuracy on train and test sets for decision tree
dt_train_acc <- round(mean(dt_train_predictions == train_data$CLUSTER), 2)
dt_test_acc <- round(mean(dt_test_predictions == test_data$CLUSTER), 2)

# Create comparison dataframe
dt_acc_comp <- data.frame(
  Set = c("Train", "Test"),
  Accuracy = c(dt_train_acc, dt_test_acc))

# Display the formatted table with kable
dt_acc_comp %>%
  kable(caption = "Decision Tree Accuracy Comparison (Train vs Test)", 
        col.names = c("Dataset", "Accuracy")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Decision Tree Accuracy Comparison (Train vs Test)
Dataset Accuracy
Train 0.91
Test 0.91

The model has an accuracy of 91% on both the train and test sets, demonstrating strong performance across all clusters. In Cluster 1: High Demand Customers, precision is 90.2% and recall is 82.5%. For Cluster 2: Intermediate Customers with Growth Potential, precision is 91.5% and recall is 90.8%. For Cluster 3: Less Active Customers with Low Order Volume, precision is 89.6% and recall is 97.0%.

Overall, the model performs well across all clusters, with strong precision and recall values for Cluster 1 and Cluster 3, and solid performance in Cluster 2. The accuracy comparison between the train and test sets is identical at 91%, indicating good generalization.

Code
# Extract and display variable importance from the trained decision tree model
var_importance <- decision_tree_model$finalModel$variable.importance
dt_var_importance_df <- data.frame(
  Variable = names(var_importance),
  Importance = var_importance
)

# Normalize importance values
dt_var_importance_df <- dt_var_importance_df %>%
  mutate(Importance = Importance / max(Importance))

# Sort by importance and visualize top 10 variables
dt_var_importance_df <- dt_var_importance_df %>% 
  arrange(desc(Importance)) %>%
  head(10)

# Plot the top 10 most important variables
ggplot(dt_var_importance_df, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "seagreen") +
  coord_flip() +
  labs(
    title = "Top 10 Variables Explaining Customer Clusters",
    subtitle = "Decision Tree Variable Importance",
    x = NULL,
    y = "Relative Importance (Normalized)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 12, color = "gray30"),
    axis.title = element_text(face = "bold"),
    panel.grid.major.y = element_blank()
  )

The most important variables in the model were the number of orders per customer, average days between orders, RFM score, total ordered volume, total cost, and the low demand customers flag.

8.2 Multinomial Logistic Regression

The influence of the selected variables on customer clusters will be explored using multinomial logistic regression to predict the probabilities of new customers belonging to each of the established clusters. This method is well-suited for modeling the relationship between the predictors and the probabilities of customers being assigned to one of the three clusters, helping to assess the likelihood of a customer belonging to each specific group based on their characteristics.

Variable standardization and Elastic Net regularization will be used in the model development process.

Code
# Normalize predictors
preprocess_params <- preProcess(model_data, method = c("center", "scale"))
model_data <- predict(preprocess_params, model_data)

# Create train/test split (70% train, 30% test)
set.seed(500)  
train_indices <- createDataPartition(model_data$CLUSTER, p = 0.7, list = FALSE)
train_data <- model_data[train_indices, ]
test_data <- model_data[-train_indices, ]

# Set up cross-validation 
train_control <- trainControl(method = "cv", number = 10)

# Define a smaller tuning grid for efficiency
tune_grid <- expand.grid(alpha = 0.5, 
                         lambda = seq(0.1, 1, length = 5))

# Train model with Elastic Net regularization
mlogistic_model <- train(
  CLUSTER ~ ., 
  data = train_data,  
  method = "glmnet",
  trControl = train_control,
  tuneGrid = tune_grid,
  control = list(maxit = 200000),
)

# Print trained model summary
print(mlogistic_model)
glmnet 

21225 samples
   21 predictor
    3 classes: '1', '2', '3' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 19102, 19103, 19102, 19103, 19103, 19102, ... 
Resampling results across tuning parameters:

  lambda  Accuracy   Kappa    
  0.100   0.8948407  0.8217432
  0.325   0.7363958  0.5030291
  0.550   0.5250412  0.0000000
  0.775   0.5250412  0.0000000
  1.000   0.5250412  0.0000000

Tuning parameter 'alpha' was held constant at a value of 0.5
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0.5 and lambda = 0.1.
Code
# Make predictions on the test set
mlogis_predictions <- predict(mlogistic_model, test_data)

# Evaluate model performance
mlogistic_model_performance <- postResample(pred = mlogis_predictions, obs = test_data$CLUSTER)
print(mlogistic_model_performance)
 Accuracy     Kappa 
0.8931281 0.8189136 
Code
# Display confusion matrix
mlogistic_confusion_matrix <- confusionMatrix(mlogis_predictions, test_data$CLUSTER)
print(mlogistic_confusion_matrix)
Confusion Matrix and Statistics

          Reference
Prediction    1    2    3
         1 1346    2    0
         2  555 4575  216
         3    0  199 2202

Overall Statistics
                                          
               Accuracy : 0.8931          
                 95% CI : (0.8866, 0.8994)
    No Information Rate : 0.5251          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8189          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3
Sensitivity            0.7080   0.9579   0.9107
Specificity            0.9997   0.8215   0.9702
Pos Pred Value         0.9985   0.8558   0.9171
Neg Pred Value         0.9284   0.9464   0.9677
Prevalence             0.2090   0.5251   0.2659
Detection Rate         0.1480   0.5030   0.2421
Detection Prevalence   0.1482   0.5878   0.2640
Balanced Accuracy      0.8539   0.8897   0.9404
Code
# Generate predictions on train set
train_predictions <- predict(mlogistic_model, train_data)

# Evaluate accuracy on train and test sets
mlogistic_train_acc <- round(postResample(pred = train_predictions, obs = train_data$CLUSTER)["Accuracy"], 2)
mlogistic_test_acc <- round(postResample(pred = mlogis_predictions, obs = test_data$CLUSTER)["Accuracy"], 2)

# Create comparison dataframe
mlogistic_acc_comp <- data.frame(
  Set = c("Train", "Test"),
  Accuracy = c(mlogistic_train_acc, mlogistic_test_acc)
)

# Display the formatted table
mlogistic_acc_comp %>%
  kable(caption = "Multinomial Logistic Regression Accuracy Comparison (Train vs Test)", 
        col.names = c("Dataset", "Accuracy")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Multinomial Logistic Regression Accuracy Comparison (Train vs Test)
Dataset Accuracy
Train 0.89
Test 0.89

The model achieved an accuracy of 89.3% on the test set, reflecting strong performance. In Cluster 1 (Red): High Demand Customers, recall is 70.8% and precision is 99.8%. For Cluster 2 (Blue): Intermediate Customers with Growth Potential, recall is 95.8% and precision is 85.6%. Finally, Cluster 3 (Yellow): Less Active Customers with Low Order Volume shows recall of 91 % and precision of 91.7%. Overall, the model performs well, with Cluster 2 showing the highest recall and Cluster 1 having the strongest precision.

The relatively low recall in Cluster 1 (Red) (70.8%) suggests that the model may not always correctly identify customers in this group, leading to false negatives.

Code
# Extract variable importance from the multinomial model
variable_importance <- varImp(mlogistic_model, scale = TRUE)

# Define custom colors for the clusters
palette_clusters <- c(
  "1" = "#FF6347",  # Coral
  "2" = "#4682B4",  # Cornflower blue
  "3" = "#FFD700")  # Yellow

# Extract importance data
var_imp_df <- as.data.frame(variable_importance$importance)
var_imp_df$Variable <- rownames(var_imp_df)

# Convert to long format
var_imp_long <- melt(var_imp_df, 
                     id.vars = "Variable", 
                     variable.name = "Cluster", 
                     value.name = "Importance")

# Clean up cluster names (remove 'Overall' if present)
var_imp_long$Cluster <- gsub("Overall", "", var_imp_long$Cluster)

# Keep only top 10 variables per cluster for better visualization
top_vars <- var_imp_long %>%
  group_by(Cluster) %>%
  top_n(10, Importance) %>%
  ungroup()

# Create visualization with custom cluster colors
# Create visualization with custom cluster colors and no color legend
ggplot(var_imp_long, aes(x = reorder(Variable, Importance), y = Importance, fill = Cluster)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  facet_wrap(~ Cluster, scales = "free_x") +
  scale_fill_manual(values = palette_clusters) +  # Apply custom colors
  # Set the y-axis (importance) to have the same scale 0-100 for all facets
  scale_y_continuous(limits = c(0, 100)) +
  labs(
    title = "Multinomial Logistic Regression",
    subtitle = "Variable Importance by Cluster",
    x = "Variables",
    y = "Importance"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 12, color = "gray30"),
    strip.text = element_text(size = 12, face = "bold"),
    axis.title = element_text(face = "bold"),
    axis.text.y = element_text(size = 9),
    panel.grid.major.y = element_blank()
  ) +
  guides(fill = "none")  # Remove the color legend

The model indicates that:

For Cluster 1, the key variables included the number of orders, RFM score, order type (MyCoke Legacy), order type (MyCoke 360), order type (using sales representatives), and chain member.

For Cluster 2, the most significant variables were the average number of days between orders, low demand customers, order type (call center), order type (MyCoke Legacy), and channel growth potential.

For Cluster 3, the most important variables were the average number of days between orders, low demand customers, RFM score, and days since the first delivery.

The models created to predict clusters for new customers performed well and provide insights that clearly help in understanding the characteristics influencing the clusters. Therefore, we can proceed with the final analysis for fleet assignment.

9. Data driven fleet assingment

Based on all the previous analyses, it is concluded that the fleet type designated for clients should be defined by considering different criteria, not just the average annual volume demand.

The main criteria shaping this approach include the similarities among clients represented by the clusters, the analysis of volume distributions by cold drink channel segment, and the growth potential of the clients.

Before proceeding, the relationship between the 400 gallons annual threshold for each cluster will be analyzed.

Code
# Define custom colors for the clusters
palette_clusters <- c(
  "1" = "#FF6347",  # Coral
  "2" = "#4682B4",  # Cornflower blue
  "3" = "#FFD700"   # Yellow
)

# Create a data frame for the threshold line
threshold_line <- data.frame(
  x = c(1, 500),
  y = c(400, 400),
  type = "400 Gallons Threshold"
)

# Create scatter plot with log scales, separated by CLUSTER using facet_wrap
ggplot() +
  geom_jitter(data = full_data_customer, 
              aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color = as.factor(CLUSTER)),
              alpha = 0.5, width = 0.2) +  # Jitter to avoid overplotting
  geom_line(data = threshold_line, 
            aes(x = x, y = y, linetype = type), 
            color = "red", size = 1) +
  scale_x_log10(limits = c(1, 500)) +  # Log scale for x-axis with specific limits (1 -> 500)
  scale_y_log10(
    limits = c(10, 1000000),  # Set limits for the y-axis from 10 to 1,000,000
    breaks = c(10, 100, 1000, 10000, 100000, 1000000),  # Custom breaks for the Y-axis
    labels = scales::comma  # Format numbers with commas
  ) + 
  scale_color_manual(values = palette_clusters, name = "Cluster") +
  scale_linetype_manual(values = "solid", name = "") +  # Add the threshold line to legend
  labs(
    title = "Average Annual Consumption vs. Number of Orders Cluster",
    x = "Number of Orders (Log Scale)",
    y = "Average Annual Consumption (Log Scale)"
  ) +
  facet_wrap(~ CLUSTER, scales = "fixed") +  # Ensure same scale across all facets
  theme_minimal() +
  theme(
    text = element_text(size = 12),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    strip.text = element_text(size = 9),  # Adjust facet labels' size
    panel.grid.major.y = element_line(color = "gray90", linetype = "solid", size = 0.3),  # Major Y grid lines for the specific breaks
    panel.grid.major.x = element_line(color = "gray90", linetype = "solid", size = 0.3),  # Light gray vertical grid lines as background
    panel.grid.minor = element_blank(),  # Remove minor grid lines
    panel.background = element_rect(fill = "white", color = "white"),  # Ensure clean background
    legend.position = "right"  # Move legend to right side
  )

Regarding the 400-gallon benchmark for defining clients to be served by red trucks, it is possible to note that:

  • Cluster 1: This cluster mainly selects clients with higher demand volumes or a larger number of orders. A smaller portion of these clients would fall below the 400-gallon threshold, with some still close to a minimum of 100 gallons.

  • Cluster 2: This cluster has large number of clients above and below the threshold, so it requires further refinement.

  • Cluster 3: The vast majority of clients fall below the threshold. However, the few clients above it tend to place a small number of orders per year.

9.1 Cluster 2 Analysis for Fleet Assignment

Cluster 2 comprises just over half of all clients, making it difficult to define clear criteria for fleet designation.

The multinomial regression model indicated that the variable “Average Days Between Orders” (AVG_DAYS_BET_ORD) was the most important, while in the decision tree model, it was the second most important variable.

Therefore, below is the plot showing the relationship between the average annual consumption of each client and their average days between orders.

Code
# Define custom colors for the clusters
palette_clusters <- c(
  "1" = "#FF6347",  # Coral
  "2" = "#4682B4",  # Cornflower blue
  "3" = "#FFD700"   # Yellow
)

# Create a data frame for the threshold line
threshold_line <- data.frame(
  x = c(1, 500),
  y = c(400, 400),
  type = "400 Gallons Threshold"
)

# Create scatter plot with log scale for x-axis, separated by CLUSTER using facet_wrap
ggplot() +
  geom_jitter(data = full_data_customer, 
              aes(x = AVG_DAYS_BET_ORD, y = AVG_ANNUAL_CONSUMP, color = as.factor(CLUSTER)),
              alpha = 0.5, width = 0.2) +  # Jitter to avoid overplotting
  geom_line(data = threshold_line, 
            aes(x = x, y = y, linetype = type), 
            color = "red", size = 1) +
  scale_x_log10(
    limits = c(1, 1000),  # Set limits for the x-axis
    breaks = c(10, 100, 1000),  # Custom breaks for the X-axis
    labels = scales::comma  # Format numbers with commas
  ) +
  scale_y_log10(
    limits = c(10, 1000000),  # Set limits for the y-axis from 10 to 1,000,000
    breaks = c(10, 100, 1000, 10000, 100000, 1000000),  # Custom breaks for the Y-axis
    labels = scales::comma  # Format numbers with commas
  ) + 
  scale_color_manual(values = palette_clusters, name = "Cluster") +
  scale_linetype_manual(values = "solid", name = "") +  # Add the threshold line to legend
  labs(
    title = "Avg. Annual Consumption vs. Avg. Days Between Orders by Cluster",
    x = "Average Days Between Orders (Log Scale)",
    y = "Average Annual Consumption (Log Scale)"
  ) +
  facet_wrap(~ CLUSTER, scales = "fixed") +  # Ensure same scale across all facets
  theme_minimal() +
  theme(
    text = element_text(size = 12),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    strip.text = element_text(size = 9),  # Adjust facet labels' size
    panel.grid.major.y = element_line(color = "gray90", linetype = "solid", size = 0.3),  # Major Y grid lines for the specific breaks
    panel.grid.major.x = element_line(color = "gray90", linetype = "solid", size = 0.3),  # Light gray vertical grid lines as background
    panel.grid.minor = element_blank(),  # Remove minor grid lines
    panel.background = element_rect(fill = "white", color = "white"),  # Ensure clean background
    legend.position = "right"  # Move legend to right side
  )

Code
# Filter the data for Cluster 2
cluster_2_data <- full_data_customer %>%
  filter(CLUSTER == 2)

# Calculate deciles for AVG_DAYS_BET_ORD
deciles <- quantile(cluster_2_data$AVG_DAYS_BET_ORD, probs = seq(0, 1, 0.1))

# Create a simple data frame with the decile values, transposing it for horizontal display
decile_table <- data.frame(
  Decile = paste0(seq(0, 90, 10), "%"),
  Lower_Bound = deciles[-length(deciles)],  # All but the last quantile value
  Upper_Bound = deciles[-1]  # All but the first quantile value
)

# Print the decile table horizontally
decile_table_t <- t(decile_table[,-1])
colnames(decile_table_t) <- decile_table$Decile
decile_table_t
            0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Lower_Bound  3  11  14  17  20  24  28  33  40  52
Upper_Bound 11  14  17  20  24  28  33  40  52 731
Code
# Get a summary of AVG_DAYS_BET_ORD for Cluster 2
summary_cluster_2 <- summary(cluster_2_data$AVG_DAYS_BET_ORD)

# Round the summary 
summary_cluster_2_rounded <- round(summary_cluster_2, 0)

# Display the summary in a simple format
summary_table <- data.frame(
  Statistic = names(summary_cluster_2_rounded),
  Value = as.vector(summary_cluster_2_rounded)
)

# Print the summary table
summary_table
  Statistic Value
1      Min.     3
2   1st Qu.    16
3    Median    24
4      Mean    56
5   3rd Qu.    37
6      Max.   731

When filtering the average days between orders for Cluster 2, it is observed that 60 percent of customers have an average of 33 days or fewer between orders. The group’s average is 56.4 days, with a median of 24 days.

Building upon the previously calculated variables, low demand customers and high growth potential customers, additional criteria relevant to the business will be introduced to better segment customers within Cluster 2.

These new criteria include an average annual consumption greater than 1,349 gallons and an average of 52 or fewer days between orders. The first threshold was chosen because it represents the point at which delivery costs are minimized. The second threshold was selected due to its significant influence on the clustering model, and because customers with high growth potential (excluding low demand customers) and an average time between orders of 33 days or fewer—representing nearly two-thirds of customers—are believed to have the potential to order more frequently, thus reducing the order interval.

As a result, in the plot below, customers who are not low demand, show high growth potential, or have an average annual consumption greater than 1,349 gallons and an average of 33 or fewer days between orders will be classified as Emerging Opportunities and assigned to the red truck category.

Code
# Define custom colors
palette_clusters <- c(
  "Emerging Opportunities - RED TRUCK" = "#B33951",   # Emerging Opportunities
  "General Clients - WHITE TRUCK" = "#D3D3D3"  # General Clients
)

# Create a data frame for the threshold line
threshold_line <- data.frame(
  x = c(1, 500),
  y = c(400, 400),
  type = "400 Gallons Threshold"
)

# Filter only Cluster 2 and create category for faceting
filtered_data <- full_data_customer %>%
  filter(CLUSTER == 2) %>%
  mutate(
    Category = ifelse(LOW_DEMAND_CUST == 0 & HIGH_GROW_POT == 1 & AVG_DAYS_BET_ORD <= 33| AVG_ANNUAL_CONSUMP > 1349, 
                      "Emerging Opportunities - RED TRUCK", 
                      "General Clients - WHITE TRUCK")
  )

# Create scatter plot with facet_wrap
ggplot(filtered_data) +
  geom_jitter(aes(x = AVG_DAYS_BET_ORD, y = AVG_ANNUAL_CONSUMP, color = Category), 
              width = 0.2, alpha = 0.5) +  # Jitter to avoid overplotting
  geom_line(data = threshold_line, 
            aes(x = x, y = y, linetype = type), 
            color = "red", size = 1) +
  scale_x_log10(limits = c(1, 1000)) +  # Log scale for x-axis with specific limits (1 -> 500)
  scale_y_log10(
    limits = c(10, 1000000),  # Set limits for the y-axis from 10 to 1,000,000
    breaks = c(10, 100, 1000, 10000, 100000, 1000000),  # Custom breaks for the Y-axis
    labels = scales::comma  # Format numbers with commas
  ) + 
  scale_color_manual(values = palette_clusters, name = "Fleet Assignment") +
  scale_linetype_manual(values = "solid", name = "") +  # Add the threshold line to legend
  labs(
    title = "Cluster 2 - Avg. Annual Consumption vs. Avg. Days Between Orders",
    x = "Average Days Between Orders (Log Scale)",
    y = "Avg Annual Consumption (Log Scale)"
  ) +
  facet_wrap(~ Category, scales = "fixed") +  # Separate categories side by side
  theme_minimal() +
  theme(
    text = element_text(size = 12),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    panel.grid.major.y = element_line(color = "gray90", linetype = "solid", size = 0.3),  
    panel.grid.major.x = element_line(color = "gray90", linetype = "solid", size = 0.3),  
    panel.grid.minor = element_blank(),  # Remove minor grid lines
    panel.background = element_rect(fill = "white", color = "white"),  
    legend.position = "right",  # Move legend to right side
    strip.text = element_blank()  # Remove facet titles
  )

Code
# Add CLUSTER_2_FLEET variable to full_data_customer
full_data_customer <- full_data_customer %>%
  mutate(
    CLUSTER_2_FLEET = ifelse(CLUSTER == 2,
                             ifelse((LOW_DEMAND_CUST == 0 & HIGH_GROW_POT == 1 & AVG_DAYS_BET_ORD <= 33) | AVG_ANNUAL_CONSUMP > 1349, 
                                    "RED TRUCK", 
                                    "WHITE TRUCK"),
                             NA)
  )

Below, the impact of fleet assignment on each cold drink channel will be explored using the previous criteria for Cluster 2, and its relation to average annual consumption and the number of orders will be analyzed.

Code
# Define color palette
palette_clusters <- c("RED TRUCK" = "#B33951", "WHITE TRUCK" = "#D3D3D3")

# Filter "RED TRUCK" data from full_data_customer
cluster_2_red_truck_data <- subset(full_data_customer, !is.na(CLUSTER_2_FLEET) & CLUSTER_2_FLEET == "RED TRUCK")

# Create a data frame for the threshold line (fixed at 400 gallons)
threshold_line <- data.frame(
  x = c(1, 500),
  y = c(400, 400),
  type = "400 Gallons Threshold"
)

# Create the plot
ggplot() +
  geom_jitter(data = cluster_2_red_truck_data, 
              aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color = as.factor(CLUSTER_2_FLEET)),
              alpha = 0.5, width = 0.2) +  
  geom_line(data = threshold_line, 
            aes(x = x, y = y, linetype = type), 
            color = "red", size = 1) +
  scale_x_log10(limits = c(1, 500)) +  
  scale_y_log10(
    limits = c(10, 100000),  
    breaks = c(10, 100, 1000, 10000, 100000),  
    labels = comma  
  ) + 
  scale_color_manual(values = palette_clusters, name = "Fleet Assignment") +  
  scale_linetype_manual(values = "solid", name = NULL) +  
  labs(
    title = "Average Annual Consumption vs. Number of Orders for Cluster 2",
    x = "Number of Orders (Log Scale)",
    y = "Average Annual Consumption (Log Scale)"
  ) +
  facet_wrap(~ COLD_DRINK_CHANNEL, scales = "fixed") +  
  theme_minimal() +
  theme(
    text = element_text(size = 12),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    strip.text = element_text(size = 10),  
    panel.grid.major.y = element_line(color = "gray90", linetype = "solid", size = 0.3),  
    panel.grid.major.x = element_line(color = "gray90", linetype = "solid", size = 0.3),  
    panel.grid.minor = element_blank(),  
    panel.background = element_rect(fill = "white", color = "white"),  
    legend.position = "right"  
  )

In an effort to explore growth opportunities, almost all sectors would have a considerable number of clients with a volume of less than 400 gallons but using red trucks.

Code
# Define color palette
palette_clusters <- c("RED TRUCK" = "#B33951", "WHITE TRUCK" = "#D3D3D3")

# Filter "WHITE TRUCK" data from full_data_customer
cluster_3_white_truck_data <- subset(full_data_customer, !is.na(CLUSTER_2_FLEET) & CLUSTER_2_FLEET == "WHITE TRUCK")

# Create a data frame for the threshold line (fixed at 400 gallons)
threshold_line <- data.frame(
  x = c(1, 500),
  y = c(400, 400),
  type = "400 Gallons Threshold"
)

# Create the plot
ggplot() +
  geom_jitter(data = cluster_3_white_truck_data, 
              aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color = as.factor(CLUSTER_2_FLEET)),
              alpha = 0.5, width = 0.2) +  
  geom_line(data = threshold_line, 
            aes(x = x, y = y, linetype = type), 
            color = "red", size = 1) +
  scale_x_log10(limits = c(1, 500)) +  
  scale_y_log10(
    limits = c(10, 100000),  
    breaks = c(10, 100, 1000, 10000, 100000),  
    labels = comma  
  ) + 
  scale_color_manual(values = palette_clusters, name = "Fleet Assignment") +  
  scale_linetype_manual(values = "solid", name = NULL) +  
  labs(
    title = "Average Annual Consumption vs. Number of Orders for Cluster 2",
    x = "Number of Orders (Log Scale)",
    y = "Average Annual Consumption (Log Scale)"
  ) +
  facet_wrap(~ COLD_DRINK_CHANNEL, scales = "fixed") +  
  theme_minimal() +
  theme(
    text = element_text(size = 12),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    strip.text = element_text(size = 10),  
    panel.grid.major.y = element_line(color = "gray90", linetype = "solid", size = 0.3),  
    panel.grid.major.x = element_line(color = "gray90", linetype = "solid", size = 0.3),  
    panel.grid.minor = element_blank(),  
    panel.background = element_rect(fill = "white", color = "white"),  
    legend.position = "right"  
  )

On the other hand, the criteria naturally assign white trucks to a large number of clients with an average annual volume of less than 400 gallons in each segment, while still ensuring that high-volume clients are served by red trucks.

Also, the previous graphs represent an opportunity for the company to develop targeted strategies for each segment.

9.2 Fleet Assignment Criteria

Based on the analysis, the recommended fleet assignment will be determined by the following criteria:

  1. Customers with an average annual consumption greater than 1349 gallons will be assigned to RED TRUCKS.

  2. Low-demand customers (identified by LOW_DEMAND_CUST == 1) will be assigned to WHITE TRUCKS.

  3. All customers in Cluster 1 will be assigned to RED TRUCKS, according to the previous rules.

  4. All customers in Cluster 3 will be assigned to WHITE TRUCKS, after applying the previous rules.

  5. Customers in Cluster 2 will be assigned to RED TRUCKS if they meet at least one of the following conditions:

    • They are classified as high growth potential (HIGH_GROW_POT == 1).
    • Their average days between orders are less than or equal to 33 (AVG_DAYS_BET_ORD <= 33).
  6. The remaining customers in Cluster 2 will be assigned to WHITE TRUCKS.

  7. Any customers who do not meet any of these criteria will remain unclassified (NA).

Code
# Assign customers to RED TRUCK or WHITE TRUCK based on specified criteria
full_data_customer <- full_data_customer %>%
  mutate(
    NEW_FLEET = case_when(
      AVG_ANNUAL_CONSUMP > 1349 ~ "RED TRUCK",           # Customers with high annual consumption
      LOW_DEMAND_CUST == 1 ~ "WHITE TRUCK",             # Low-demand customers
      CLUSTER == 1 ~ "RED TRUCK",                       # Cluster 1 customers
      CLUSTER == 3 ~ "WHITE TRUCK",                     # Cluster 3 customers
      CLUSTER == 2 & (LOW_DEMAND_CUST == 0 & HIGH_GROW_POT == 1 & AVG_DAYS_BET_ORD <= 33) ~ "RED TRUCK",  # Cluster 2 customers with all conditions met
      CLUSTER == 2 ~ "WHITE TRUCK",                     # Remaining Cluster 2 customers
      TRUE ~ NA_character_                              # Others remain NA
    )
  )

Below are the representations of the clusters and the designated fleet.

Code
# Define custom colors for the fleet and clusters
palette_fleet <- c(
  "RED TRUCK" = "#B33951",  # Red truck
  "WHITE TRUCK" = "#D3D3D3"  # White truck
)

palette_clusters <- c(
  "Cluster 1" = "#FF6347",  # Coral
  "Cluster 2" = "#4682B4",  # Cornflower blue
  "Cluster 3" = "#FFD700"   # Yellow
)

# Create a data frame for the threshold line
threshold_line <- data.frame(
  x = c(1, 500),
  y = c(400, 400),
  type = "400 Gallons Threshold"
)

# Filter the data for RED TRUCK and WHITE TRUCK
red_truck_data <- full_data_customer %>% filter(NEW_FLEET == "RED TRUCK")
white_truck_data <- full_data_customer %>% filter(NEW_FLEET == "WHITE TRUCK")

# Combine both datasets to differentiate them in facet_wrap
combined_data <- bind_rows(
  red_truck_data %>% mutate(Fleet_Type = "RED TRUCK"),
  white_truck_data %>% mutate(Fleet_Type = "WHITE TRUCK")
)

# Define a custom labeller for the clusters
custom_labeller <- labeller(
  CLUSTER = c(
    "1" = "Cluster 1",
    "2" = "Cluster 2",
    "3" = "Cluster 3"
  )
)

# Create scatter plot with log scales and no background color for facet labels
ggplot(combined_data) +
  geom_jitter(aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color = Fleet_Type),
              alpha = 0.5, width = 0.2) +  # Jitter to avoid overplotting
  geom_line(data = threshold_line, 
            aes(x = x, y = y, linetype = type), 
            color = "red", size = 1) +
  scale_x_log10(limits = c(1, 500)) +  # Log scale for x-axis with specific limits
  scale_y_log10(
    limits = c(10, 1000000),
    breaks = c(10, 100, 1000, 10000, 100000, 1000000),
    labels = scales::comma
  ) + 
  scale_color_manual(values = palette_fleet) +
  scale_linetype_manual(values = "solid", name = "") + 
  labs(
    title = "Fleet Assignment by Cluster",
    x = "Number of Orders (Log Scale)",
    y = "Average Annual Consumption (Log Scale)"
  ) +
  facet_wrap(~ CLUSTER + Fleet_Type, scales = "fixed", nrow = 1, labeller = custom_labeller) +
  theme_minimal() +
  theme(
    text = element_text(size = 12),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    strip.text = element_text(size = 9),  # Adjust text size for facet labels
    strip.background = element_blank(),  # Remove background color from facet labels
    panel.grid.major.y = element_line(color = "gray90", linetype = "solid", size = 0.3),
    panel.grid.major.x = element_line(color = "gray90", linetype = "solid", size = 0.3),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white", color = "white"),
    legend.position = "bottom",
    legend.box = "vertical"
  ) +
  guides(color = "none")  # Remove legend for Fleet_Type

Code
# Summmary
#summary(as.factor(full_data_customer$NEW_FLEET))

#summary(as.factor(full_data_customer$FLEET_TYPE))
#  RED TRUCK WHITE TRUCK 
#       7239       23081 

The new criteria established labels for all customers. A total of 7,926 customers were assigned to “Red Truck”, while 22,394 customers were assigned to “White Truck”.

The annual average consumption criterion of 400 gallons would have assigned 7,239 customers to be served by “Red Truck” and 23,081 customers to be served by “White Trucks”.

Therefore, 687 clients who were previously served by white trucks and who present higher growth potential will now be served by red trucks.

Code
# Create a combined summary for both fleet types with percentages calculated separately by fleet_designation
summary_fleet_comparison_percent <- full_data_customer %>%
  # Create a longer dataset with both fleet designations
  pivot_longer(
    cols = c(NEW_FLEET, FLEET_TYPE),
    names_to = "fleet_designation",
    values_to = "fleet_value"
  ) %>%
  # Convert to factors
  mutate(
    LOCAL_FOUNT_ONLY = as.factor(LOCAL_FOUNT_ONLY),
    fleet_designation = factor(fleet_designation, 
                             levels = c("FLEET_TYPE", "NEW_FLEET"),
                             labels = c("Over 400 gallons threshold", "Recommended Fleet Type"))
  ) %>%
  # Group by fleet_designation and fleet_value, then calculate count
  group_by(fleet_designation, fleet_value) %>%
  summarise(count = n(), .groups = 'drop') %>%
  # Calculate the percentage for each fleet_designation
  group_by(fleet_designation) %>%
  mutate(percentage = (count / sum(count)) * 100) %>%
  ungroup()

# Plot with facet_wrap and custom background showing percentage values
ggplot(summary_fleet_comparison_percent, 
       aes(x = fleet_value, y = percentage, fill = fleet_value)) +
  # Add background based on facet
  geom_rect(data = data.frame(fleet_designation = "Over 400 gallons threshold"),
            aes(xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf),
            fill = "lightyellow", alpha = 0.3, inherit.aes = FALSE) +  # Light gray background for "Over 400 gallons threshold"
  geom_bar(stat = "identity", position = "dodge", alpha = 0.8) +
  geom_text(aes(label = scales::comma(percentage, accuracy = 0.1, suffix = "%")), 
            position = position_dodge(width = 0.8), vjust = -0.5, size = 3.5) +
  facet_wrap(~ fleet_designation, scales = "fixed") +  # Fixed scale for both facets
  labs(title = "Comparison of Customer Distribution by Fleet Type Designation",
       x = "Fleet Type") +
  scale_fill_manual(values = c("RED TRUCK" = "#B33951", 
                               "WHITE TRUCK" = "#D3D3D3")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(size = 9),
    axis.title.y = element_text(size = 10),
    axis.text.x = element_text(size = 10),
    axis.title.x = element_blank(),
    legend.position = "none",
    panel.grid.major.x = element_blank(),
    strip.text = element_text(size = 11, face = "bold"),
    strip.background = element_rect(fill = "white", color = NA),
    panel.spacing = unit(1, "lines")
  )

According to the criteria, 26% of customers would be served by red trucks and 74% by white trucks.

Code
# Creating a combined summary for both fleet types
summary_fleet_comparison_absolute <- full_data_customer %>%
  # Create a longer dataset with both fleet designations
  pivot_longer(
    cols = c(NEW_FLEET, FLEET_TYPE),
    names_to = "fleet_designation",
    values_to = "fleet_value"
  ) %>%
  # Convert to factors
  mutate(
    LOCAL_FOUNT_ONLY = as.factor(LOCAL_FOUNT_ONLY),
    fleet_designation = factor(fleet_designation, 
                             levels = c("FLEET_TYPE", "NEW_FLEET"),
                             labels = c("Over 400 gallons threshold", "Recommended Fleet Type"))
  ) %>%
  # Group and calculate total counts (absolute numbers)
  group_by(fleet_designation, LOCAL_FOUNT_ONLY, fleet_value) %>%
  summarise(count = n()) %>%
  ungroup()

# Plot with facet_wrap and custom background showing total customer counts
ggplot(summary_fleet_comparison_absolute, 
       aes(x = LOCAL_FOUNT_ONLY, y = count, fill = fleet_value)) +
  # Add background based on facet
  geom_rect(data = data.frame(fleet_designation = "Over 400 gallons threshold"),
            aes(xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf),
            fill = "lightyellow", alpha = 0.3, inherit.aes = FALSE) +  # Light gray background for "Over 400 gallons threshold"
  geom_bar(stat = "identity", position = "dodge", alpha = 0.8) +
  geom_text(aes(label = scales::comma(count)), 
            position = position_dodge(width = 0.8), vjust = -0.5, size = 3.5) +
  facet_wrap(~ fleet_designation, scales = "fixed") +  # Fixed scale for both facets
  labs(title = "Comparison of Number of Customers by Fleet Type Designation",
       x = "Customer Type") +
  scale_fill_manual(values = c("RED TRUCK" = "#B33951", 
                               "WHITE TRUCK" = "#D3D3D3")) +
  scale_x_discrete(labels = c("0" = "Others", "1" = "Local Fountain Only")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(size = 9),
    axis.title.y = element_text(size = 10),
    axis.text.x = element_text(size = 10, color = "black"),
    axis.title.x = element_blank(),
    legend.position = "bottom",
    legend.title = element_text(face = "bold"),
    panel.grid.major.x = element_blank(),
    strip.text = element_text(size = 11, face = "bold"),
    strip.background = element_rect(fill = "white", color = NA),
    panel.spacing = unit(1, "lines")
  )

Considering only the “Others” group (customers who order from multiple sources), our recommendation would result in 729 additional stores being served by red trucks, compared to the 400-gallon threshold—an increase of 10.3%.

In contrast, within the ‘Local Fountain Only’ group, the number of customers served by red trucks would decrease by 42, representing a 23.2% reduction.

Code
# Creating the summary data with both fleet designations
summary_volume_comparison <- full_data_customer %>%
  # Calculate total volume per customer
  mutate(total_volume = QTD_DLV_CA_2023 + QTD_DLV_GAL_2023 + 
                         QTD_DLV_CA_2024 + QTD_DLV_GAL_2024) %>%
  # Create a longer format dataset with both fleet designations
  pivot_longer(
    cols = c(NEW_FLEET, FLEET_TYPE),
    names_to = "fleet_designation",
    values_to = "fleet_value"
  ) %>%
  # Group by fleet designation, value, and LOCAL_FOUNT_ONLY
  group_by(fleet_designation, fleet_value, LOCAL_FOUNT_ONLY) %>%
  # Sum volumes within each group
  summarise(group_volume = sum(total_volume, na.rm = TRUE)) %>%
  ungroup() %>%
  # Calculate total volume for percentage
  group_by(fleet_designation) %>%
  mutate(total_designation_volume = sum(group_volume),
         percentage = group_volume / total_designation_volume * 100) %>%
  ungroup() %>%
  # Convert to factors for proper ordering
  mutate(
    LOCAL_FOUNT_ONLY = as.factor(LOCAL_FOUNT_ONLY),
    fleet_designation = factor(fleet_designation, 
                               levels = c("FLEET_TYPE", "NEW_FLEET"),
                               labels = c("Over 400 gallons threshold", "Recommended Fleet Type"))
  )

# Create the faceted chart
ggplot(summary_volume_comparison, 
       aes(x = LOCAL_FOUNT_ONLY, y = percentage, fill = fleet_value)) +
  # Add background based on facet
  geom_rect(data = data.frame(fleet_designation = "Over 400 gallons threshold"),
            aes(xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf),
            fill = "lightyellow", alpha = 0.3, inherit.aes = FALSE) +  # Light yellow background for Over 400 gallons threshold
  geom_bar(stat = "identity", position = "dodge", alpha = 0.8) +
  geom_text(aes(label = scales::comma(percentage, accuracy = 0.1, suffix = "%")), 
            position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  facet_wrap(~ fleet_designation, scales = "fixed") +  # Fixed scale for both facets
  labs(title = "Comparison of Volume Distribution by Fleet Type Designation",
       y = "Percentage of Total Volume") +
  # Set colors - assuming similar colors for both designations
  scale_fill_manual(values = c("RED TRUCK" = "#B33951", "WHITE TRUCK" = "#D3D3D3"),
                    name = "Fleet Type") +
  # Set x-axis labels 
  scale_x_discrete(labels = c("0" = "Others", "1" = "Local Fountain Only")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(size = 9),
    axis.title.y = element_text(size = 10),
    axis.text.x = element_text(size = 10, angle = 0),
    axis.title.x = element_blank(),
    legend.position = "bottom",
    panel.grid.major.x = element_blank(),
    strip.text = element_text(size = 11, face = "bold")
  )

Although the number of customers served by red trucks has increased, the overall volume transported remains relatively stable.

Within the “Others” customer group, there would be a reduction of approximately 1,038,637 gallons over two years, representing a 3.4% decrease. This volume would now be delivered by white trucks.

For the “Local Fountain Only” group, the reduction in volume transported by red trucks is around 104,895 gallons over two years a 31% decrease.

Despite the increase in the number of customers served by red trucks, which may lead to higher travel times and costs, the recommendation optimizes the delivery system by allowing red trucks to focus on strategic customers while reducing overall costs through higher-volume deliveries using white trucks.

A geographic distribution analysis of the customer base can be carried out at a later stage. One opportunity that emerges from this recommendation is to encourage customers within the same ZIP code to coordinate delivery dates. This would help consolidate volumes, streamline the delivery process, and further reduce operational costs.

Below is the average number of days between orders for each group.

Code
# Creating a combined summary for both fleet types with mean AVG_DAYS_BET_ORD
summary_fleet_comparison_absolute <- full_data_customer %>%
  # Create a longer dataset with both fleet designations
  pivot_longer(
    cols = c(NEW_FLEET, FLEET_TYPE),
    names_to = "fleet_designation",
    values_to = "fleet_value"
  ) %>%
  # Convert to factors
  mutate(
    LOCAL_FOUNT_ONLY = as.factor(LOCAL_FOUNT_ONLY),
    fleet_designation = factor(fleet_designation, 
                               levels = c("FLEET_TYPE", "NEW_FLEET"),
                               labels = c("Over 400 gallons threshold", "Recommended Fleet Type"))
  ) %>%
  # Group by relevant factors and calculate the mean of AVG_DAYS_BET_ORD (average days between orders)
  group_by(fleet_designation, LOCAL_FOUNT_ONLY, fleet_value) %>%
  summarise(mean_days_bet_ord = mean(AVG_DAYS_BET_ORD, na.rm = TRUE)) %>%
  ungroup()

# Plot with facet_wrap and custom background showing mean AVG_DAYS_BET_ORD
ggplot(summary_fleet_comparison_absolute, 
       aes(x = LOCAL_FOUNT_ONLY, y = mean_days_bet_ord, fill = fleet_value)) +
  # Add background based on facet
  geom_rect(data = data.frame(fleet_designation = "Over 400 gallons threshold"),
            aes(xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf),
            fill = "lightyellow", alpha = 0.3, inherit.aes = FALSE) +  # Light yellow background for "Over 400 gallons threshold"
  geom_bar(stat = "identity", position = "dodge", alpha = 0.8) +
  geom_text(aes(label = scales::number(mean_days_bet_ord, accuracy = 0.1)), 
            position = position_dodge(width = 0.8), vjust = -0.5, size = 3.5) +
  facet_wrap(~ fleet_designation, scales = "fixed") +  # Fixed scale for both facets
  labs(title = "Comparison of Mean Days Between Orders by Fleet Type Designation",
       x = "Customer Type") +
  scale_fill_manual(values = c("RED TRUCK" = "#B33951", "WHITE TRUCK" = "#D3D3D3"),
                    name = "Fleet Type") +
  scale_x_discrete(labels = c("0" = "Others", "1" = "Local Fountain Only")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(size = 9),
    axis.title.y = element_text(size = 10),
    axis.text.x = element_text(size = 10, color = "black"),
    axis.title.x = element_blank(),
    legend.position = "bottom",
    legend.title = element_text(face = "bold"),
    panel.grid.major.x = element_blank(),
    strip.text = element_text(size = 11, face = "bold"),
    strip.background = element_rect(fill = "white", color = NA),
    panel.spacing = unit(1, "lines")
  )

The red trucks should be optimized to serve the “Others” group, which has an average order interval of 14 days, compared to 23 days under the 400-gallon threshold model. The difference for the “Local Fountain Only” group in relation to the white trucks would be approximately 2 days.

The white trucks, on the other hand, would serve more sporadic customers, with an average interval of over 260 days between orders.

10. Recommendation Impacts

10.1 Impact on Costs

The cost impact of using red trucks is significantly higher compared to white trucks. For OPEX, the delivery cost for red trucks is approximately 700% more than for white trucks when considering only variable costs.

The calculated cost for the total volume delivered to each customer via red trucks is represented in the column total_cos_ca_gal. To provide conservative estimates, a 400% difference is assumed, and the red truck cost is divided by 5 to estimate the cost for white trucks, represented by ARTM_TOTAL_COST.

Below is the cost comparison.

Code
# Reducing the TOTAL_COST_CA_GAL by 
full_data_customer <- full_data_customer %>%
  mutate(ARTM_TOTAL_COST = TOTAL_COST_CA_GAL / 5)

# Creating the summary data with both fleet designations for delivery cost analysis
summary_delivery_cost_comparison <- full_data_customer %>%
  # Create a column that has the appropriate cost based on fleet type
  mutate(delivery_cost = case_when(
    FLEET_TYPE == "WHITE TRUCK" ~ ARTM_TOTAL_COST,
    FLEET_TYPE == "RED TRUCK" ~ TOTAL_COST_CA_GAL,
    TRUE ~ NA_real_
  )) %>%
  # Reshape the data into a longer format with both fleet designations
  pivot_longer(
    cols = c(NEW_FLEET, FLEET_TYPE),
    names_to = "fleet_designation",
    values_to = "fleet_value"
  ) %>%
  # Update the delivery cost calculation for NEW_FLEET designation
  mutate(delivery_cost = case_when(
    fleet_designation == "NEW_FLEET" & fleet_value == "WHITE TRUCK" ~ ARTM_TOTAL_COST,
    fleet_designation == "NEW_FLEET" & fleet_value == "RED TRUCK" ~ TOTAL_COST_CA_GAL,
    fleet_designation == "FLEET_TYPE" & fleet_value == "WHITE TRUCK" ~ ARTM_TOTAL_COST,
    fleet_designation == "FLEET_TYPE" & fleet_value == "RED TRUCK" ~ TOTAL_COST_CA_GAL,
    TRUE ~ delivery_cost
  )) %>%
  # Group by fleet designation, value, and LOCAL_FOUNT_ONLY
  group_by(fleet_designation, fleet_value, LOCAL_FOUNT_ONLY) %>%
  # Sum delivery costs in each group
  summarise(total_delivery_cost = sum(delivery_cost, na.rm = TRUE)) %>%
  ungroup() %>%
  # Convert to factors for proper ordering
  mutate(
    LOCAL_FOUNT_ONLY = as.factor(LOCAL_FOUNT_ONLY),
    fleet_designation = factor(fleet_designation, 
                               levels = c("FLEET_TYPE", "NEW_FLEET"),
                               labels = c("Over 400 gallons threshold", "Recommended Fleet Type"))
  )

# Create the faceted chart
ggplot(summary_delivery_cost_comparison, 
       aes(x = LOCAL_FOUNT_ONLY, y = total_delivery_cost, fill = fleet_value)) +
  # Add background based on facet
  geom_rect(data = data.frame(fleet_designation = "Over 400 gallons threshold"),
            aes(xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf),
            fill = "lightyellow", alpha = 0.3, inherit.aes = FALSE) +  # Light yellow background for Over 400 gallons threshold
  geom_bar(stat = "identity", position = "dodge", alpha = 0.8) +
  geom_text(aes(label = scales::dollar(total_delivery_cost, accuracy = 1)), 
            position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  facet_wrap(~ fleet_designation, scales = "free_y") +  # Allow y-axis to vary between facets if necessary
  labs(title = "Comparison of Delivery Cost by Fleet Type Designation",
       y = "Total Delivery Cost ($ Millions)") +
  # Set colors for the fleet types
  scale_fill_manual(values = c("RED TRUCK" = "#B33951", "WHITE TRUCK" = "#D3D3D3"),
                    name = "Fleet Type") +
  # Set x-axis labels 
  scale_x_discrete(labels = c("0" = "Others", "1" = "Local Fountain Only")) +
  # Scale y-axis to display values in millions
  scale_y_continuous(labels = scales::label_number(scale = 1e-6, suffix = "M")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(size = 9),
    axis.title.y = element_text(size = 10),
    axis.text.x = element_text(size = 10, angle = 0),
    axis.title.x = element_blank(),
    legend.position = "bottom",
    panel.grid.major.x = element_blank(),
    strip.text = element_text(size = 11, face = "bold")
  )

Code
# Summary of delivery costs by fleet designation
delivery_cost_summary <- full_data_customer %>%
  mutate(delivery_cost = case_when(
    FLEET_TYPE == "WHITE TRUCK" ~ ARTM_TOTAL_COST,
    FLEET_TYPE == "RED TRUCK" ~ TOTAL_COST_CA_GAL,
    TRUE ~ NA_real_
  )) %>%
  pivot_longer(
    cols = c(NEW_FLEET, FLEET_TYPE),
    names_to = "fleet_designation_type",
    values_to = "fleet_label"
  ) %>%
  mutate(delivery_cost = case_when(
    fleet_designation_type == "NEW_FLEET" & fleet_label == "WHITE TRUCK" ~ ARTM_TOTAL_COST,
    fleet_designation_type == "NEW_FLEET" & fleet_label == "RED TRUCK" ~ TOTAL_COST_CA_GAL,
    fleet_designation_type == "FLEET_TYPE" & fleet_label == "WHITE TRUCK" ~ ARTM_TOTAL_COST,
    fleet_designation_type == "FLEET_TYPE" & fleet_label == "RED TRUCK" ~ TOTAL_COST_CA_GAL,
    TRUE ~ delivery_cost
  )) %>%
  group_by(fleet_designation_type, fleet_label, LOCAL_FOUNT_ONLY) %>%
  summarise(total_delivery_cost = sum(delivery_cost, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(
    LOCAL_FOUNT_ONLY = as.factor(LOCAL_FOUNT_ONLY),
    fleet_designation_type = factor(fleet_designation_type,
                                    levels = c("FLEET_TYPE", "NEW_FLEET"),
                                    labels = c("Over 400 Gallons Threshold", "Recommended Fleet Type"))
  )

# Comparison table with savings
fleet_savings_summary <- delivery_cost_summary %>%
  pivot_wider(
    id_cols = c(LOCAL_FOUNT_ONLY, fleet_label),
    names_from = fleet_designation_type,
    values_from = total_delivery_cost
  ) %>%
  mutate(
    savings = `Over 400 Gallons Threshold` - `Recommended Fleet Type`,
    savings_percentage = (savings / `Over 400 Gallons Threshold`) * 100
  )

# Cost savings bar chart
savings_plot <- ggplot(fleet_savings_summary, 
       aes(x = LOCAL_FOUNT_ONLY, y = savings, fill = fleet_label)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.8) +
  geom_text(aes(label = paste0(
    scales::dollar(savings, accuracy = 1), 
    "\n(", scales::number(savings_percentage, accuracy = 0.1), "%)")), 
    position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  labs(title = "Cost Savings for Recommended Fleet Type x 400 gallons threshold",
       y = "Cost Savings ($)") +
  scale_fill_manual(values = c("RED TRUCK" = "#B33951", "WHITE TRUCK" = "#D3D3D3"),
                    name = "Fleet Type") +
  scale_x_discrete(labels = c("0" = "Others", "1" = "Local Fountain Only")) +
  coord_cartesian(ylim = c(-500000, 1500000)) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(size = 9),
    axis.title.y = element_text(size = 10),
    axis.text.x = element_text(size = 10, angle = 0),
    axis.title.x = element_blank(),
    legend.position = "bottom",
    panel.grid.major.x = element_blank()
  )
savings_plot

Regarding the 400-gallon threshold, over a two-year period the estimated differences would be:

  • Others – Red Truck: cost reduction of $803,612 (2%);
  • Others – White Truck: cost increase of $160,722 (3%);
  • Local Fountain Only – Red Truck: cost reduction of $161,079 (34%);
  • Local Fountain Only – White Truck: cost increase of $32,216 (22%).

The total cost using the 400-gallon threshold over two years would be $46,462,823, while the recommendation totals $45,691,071. The net result over these two years would be a total savings of $771,752, representing a 1.7% reduction compared to the original 400-gallon threshold strategy.

These values were calculated based on actual historical delivery volumes. Predicting whether these savings will continue in the future is highly uncertain due to many potential influencing factors—such as economic shifts, customer reactions, competitor strategies, and more. Additionally, the limited historical data (only two years) adds uncertainty to future projections.

10.2 Impact on Fleet Assignment by Cold Drink Channel

Code
# Summarize volume per Cold Drink Channel and NEW_FLEET
data_summary_fleet_channel <- full_data_customer %>%
  group_by(COLD_DRINK_CHANNEL, NEW_FLEET) %>%
  summarise(
    Total_Volume = sum(QTD_DLV_GAL_2023, na.rm = TRUE) +
                   sum(QTD_DLV_GAL_2024, na.rm = TRUE) +
                   sum(QTD_DLV_CA_2023, na.rm = TRUE) +
                   sum(QTD_DLV_CA_2024, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  # Filter out the "CONVENTIONAL" channel
  filter(COLD_DRINK_CHANNEL != "CONVENTIONAL") %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  mutate(Percentage = round(Total_Volume / sum(Total_Volume) * 100)) %>%
  ungroup()

# Order channels by total volume
channel_order <- data_summary_fleet_channel %>%
  group_by(COLD_DRINK_CHANNEL) %>%
  summarise(Channel_Total = sum(Total_Volume)) %>%
  arrange(Channel_Total) %>%
  pull(COLD_DRINK_CHANNEL)

data_summary_fleet_channel$COLD_DRINK_CHANNEL <- factor(
  data_summary_fleet_channel$COLD_DRINK_CHANNEL,
  levels = channel_order
)

# Plot
ggplot(data_summary_fleet_channel, aes(x = Total_Volume, y = COLD_DRINK_CHANNEL, fill = NEW_FLEET)) +
  geom_bar(stat = "identity", position = "stack", alpha = 0.7) +
  geom_text(
    aes(label = paste0(Percentage, "%")),
    position = position_stack(vjust = 0.5),
    hjust = 0.2,
    size = 3.2,
    color = "black"
  ) +
  geom_vline(
    xintercept = c(2500000, 5000000, 7500000, 10000000),
    color = "lightgray",
    linewidth = 0.5
  ) +
  scale_x_continuous(
    labels = comma_format(scale = 1e-6, suffix = "M"),
    breaks = c(2500000, 5000000, 7500000, 10000000),
    expand = expansion(mult = c(0, 0.05))
  ) +
  scale_fill_manual(values = c("RED TRUCK" = "#B33951", "WHITE TRUCK" = "#D3D3D3")) +
  labs(
    title = "Our Recommendation - Total Volume by Cold Drink Channel",
    x = "Total Volume (in Millions)",
    y = NULL,
    fill = "New Fleet Type"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10),
    legend.position = "bottom",
    legend.box = "horizontal",
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )

With the recommendation, the dining segment saw a 7% reduction in the number of customers previously served by red trucks, who are now being served by white trucks.

Events and Public Sector experienced a near 5% reduction in customers served by red trucks. The remaining segments saw changes of less than 2%.

The conventional segment was not displayed due to the low volume, but the change in this segment was also less than 2%.

10.3 Impact on Order Types

Code
# Merge new fleet on full_data
full_data <- full_data %>%
  left_join(full_data_customer %>% dplyr::select(CUSTOMER_NUMBER, NEW_FLEET), by = "CUSTOMER_NUMBER")


# Summarize by ORDER_TYPE and NEW_FLEET using delivered volume
data_summary_fleet_by_order <- full_data %>%
  filter(!is.na(NEW_FLEET), !is.na(ORDER_TYPE)) %>%
  group_by(ORDER_TYPE, NEW_FLEET) %>%
  summarise(TotalDelivered = sum(DELIVERED_CASES + DELIVERED_GALLONS, na.rm = TRUE), .groups = "drop") %>%
  group_by(ORDER_TYPE) %>%
  mutate(Percentage = round(TotalDelivered / sum(TotalDelivered) * 100, 0))

# Order ORDER_TYPE by total delivered volume
order_levels <- data_summary_fleet_by_order %>%
  group_by(ORDER_TYPE) %>%
  summarise(Total = sum(TotalDelivered), .groups = "drop") %>%
  arrange(Total) %>%
  pull(ORDER_TYPE)

# Reorder as factor
data_summary_fleet_by_order$ORDER_TYPE <- factor(data_summary_fleet_by_order$ORDER_TYPE, levels = order_levels)

# Plot
ggplot(data_summary_fleet_by_order, aes(x = TotalDelivered, y = ORDER_TYPE, fill = NEW_FLEET)) +
  geom_bar(stat = "identity", position = "stack", alpha = 0.6) +  
  geom_text(aes(label = paste0(Percentage, "%")), 
            position = position_stack(vjust = 0.5), 
            hjust = 0, 
            color = "black", size = 3.2) +
  labs(title = "Our Recommendation - Delivered Volume by Order Type", 
       x = "Volume (units)", 
       y = NULL, 
       fill = "New Fleet Type") +  
  scale_x_continuous(
    labels = function(x) paste0(x / 1e6, "M"),
    breaks = c(2500000, 5000000, 7500000, 10000000),
    expand = expansion(c(0, 0.05))
  ) +  
  scale_fill_manual(values = c("RED TRUCK" = "#B33951", "WHITE TRUCK" = "#D3D3D3")) +  
  theme_minimal() +  
  theme(
    plot.title = element_text(size = 10, face = "bold"),
    axis.text.y = element_text(size = 10),
    axis.title.x = element_text(size = 10, face = "plain"),
    legend.position = "right",
    legend.direction = "vertical",
    panel.grid.major.y = element_blank(),
    panel.grid.major.x = element_line(color = "lightgray", size = 0.5),
    panel.grid.minor = element_blank()
  )

The key takeaway here is that the volume served by sales reps would see only a slight reduction of about 2 percent compared to the 400 gallon threshold. This helps avoid abrupt changes that could potentially harm relationships with customers who have closer contact with our sales team.

The most significant shift however would occur with orders placed through the call center. Approximately 20 percent of the volume that would have been served by red trucks under the 400 gallon threshold would now be served by white trucks. This allows red trucks to be redirected to other types of orders with greater potential to strengthen customer relationships.

10.4 Customers Impacted

All Customers

Code
# Create WHITE_TO_RED:
# Assign 0 if both FLEET_TYPE and NEW_FLEET are "WHITE TRUCK", otherwise assign 1
full_data_customer$WHITE_TO_RED <- ifelse(
  full_data_customer$FLEET_TYPE == "WHITE TRUCK" & full_data_customer$NEW_FLEET == "WHITE TRUCK",
  0, 1)

# Create RED_TO_WHITE:
# Assign 0 if both FLEET_TYPE and NEW_FLEET are "RED TRUCK", otherwise assign 1
full_data_customer$RED_TO_WHITE <- ifelse(
  full_data_customer$FLEET_TYPE == "RED TRUCK" & full_data_customer$NEW_FLEET == "RED TRUCK",
  0, 1)

full_data_customer$CHANGED_FLEET <- ifelse(
  full_data_customer$FLEET_TYPE != full_data_customer$NEW_FLEET,
  "Yes", "No")

# Create fleet transition categories
fleet_change_summary <- full_data_customer %>%
  mutate(FLEET_STATUS = case_when(
    FLEET_TYPE == "WHITE TRUCK" & NEW_FLEET == "RED TRUCK" ~ "Changed Fleet",
    FLEET_TYPE == "RED TRUCK" & NEW_FLEET == "WHITE TRUCK" ~ "Changed Fleet",
    FLEET_TYPE == "RED TRUCK" & NEW_FLEET == "RED TRUCK" ~ "Stayed Red",
    FLEET_TYPE == "WHITE TRUCK" & NEW_FLEET == "WHITE TRUCK" ~ "Stayed White",
    TRUE ~ "Other"
  )) %>%
  filter(FLEET_STATUS != "Other") %>%
  mutate(FLEET_STATUS = factor(FLEET_STATUS, levels = c("Stayed White", "Stayed Red", "Changed Fleet"))) %>%
  group_by(FLEET_STATUS) %>%
  summarise(Num_Customers = n(), .groups = "drop") %>%
  mutate(Percentage = Num_Customers / sum(Num_Customers) * 100,
         Label = paste0(Num_Customers, " (", round(Percentage, 1), "%)"))

# Custom colors
fleet_colors <- c(
  "Stayed Red" = "#B33951",
  "Stayed White" = "#D3D3D3",
  "Changed Fleet" = "plum"
)

# Plot with value and percentage labels
ggplot(fleet_change_summary, aes(x = FLEET_STATUS, y = Num_Customers, fill = FLEET_STATUS)) +
  geom_col(show.legend = FALSE, width = 0.45) +
  geom_text(aes(label = Label), vjust = -0.5, size = 3.5) +
  scale_fill_manual(values = fleet_colors) +
  labs(
    title = "Number of Customers by Fleet Type (400 gal X New Recommendation)",
    x = "",
    y = "Number of Customers"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(size = 10),
    plot.title = element_text(face = "bold"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )

Among all customers, compared to the 400-gallon threshold, 14 percent (4,325) would have their fleet assignment changed, either from red truck to white truck or vice versa.

These 4,235 customers represent 9.3% of the total volume sold in 2023 and 2024. Of these, 2,461 would switch from white trucks to red trucks (20% of the white truck volume), while 1,774 would switch from red trucks to white trucks (7.4% of the red truck volume).

Local Market Partners - Local Fountain Only

Code
# Create fleet transition categories for LOCAL_FOUNT_ONLY customers
fleet_change_summary <- full_data_customer %>%
  filter(LOCAL_FOUNT_ONLY == 1) %>%  # Filter only LOCAL_FOUNT_ONLY customers
  mutate(FLEET_STATUS = case_when(
    FLEET_TYPE == "WHITE TRUCK" & NEW_FLEET == "RED TRUCK" ~ "Changed Fleet",
    FLEET_TYPE == "RED TRUCK" & NEW_FLEET == "WHITE TRUCK" ~ "Changed Fleet",
    FLEET_TYPE == "RED TRUCK" & NEW_FLEET == "RED TRUCK" ~ "Stayed Red",
    FLEET_TYPE == "WHITE TRUCK" & NEW_FLEET == "WHITE TRUCK" ~ "Stayed White",
    TRUE ~ "Other"
  )) %>%
  filter(FLEET_STATUS != "Other") %>%
  mutate(FLEET_STATUS = factor(FLEET_STATUS, levels = c("Stayed White", "Stayed Red", "Changed Fleet"))) %>%
  group_by(FLEET_STATUS) %>%
  summarise(Num_Customers = n(), .groups = "drop") %>%
  mutate(Percentage = Num_Customers / sum(Num_Customers) * 100,
         Label = paste0(Num_Customers, " (", round(Percentage, 1), "%)"))

# Custom colors
fleet_colors <- c(
  "Stayed Red" = "#B33951",
  "Stayed White" = "#D3D3D3",
  "Changed Fleet" = "plum"
)

# Plot with value and percentage labels
ggplot(fleet_change_summary, aes(x = FLEET_STATUS, y = Num_Customers, fill = FLEET_STATUS)) +
  geom_col(show.legend = FALSE, width = 0.45) +  # Adjust width here
  geom_text(aes(label = Label), vjust = -0.5, size = 3.5) +
  scale_fill_manual(values = fleet_colors) +
  labs(
    title = "LFO Number of Customers (400 gal X New Recommendation)",
    x = "",
    y = "Number of Customers"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(size = 10),
    plot.title = element_text(face = "bold"),
    panel.grid.major = element_blank(),  # Remove major grid lines
    panel.grid.minor = element_blank()   # Remove minor grid lines
  )

Among local market partners, 148 customers (11%) would switch fleets, making up 25% of the group’s total volume. Of these, 95 switched from red trucks to white trucks, which is 52% of red truck customers and 37% of the red truck volume in this group.

Additionally, 53 customers switched from white trucks to red trucks, representing 4.5% of white truck customers and 9% of the white truck volume in this group.

Impacts on Chain Members

Code
# Define levels to control the bar order
fleet_status_levels <- c("Stayed Red", "Red to White", "Stayed White", "White to Red")

# Prepare the data with CHAIN_MEMBER included
fleet_change_data <- full_data_customer %>%
  mutate(
    FLEET_ORIGIN = case_when(
      FLEET_TYPE == "RED TRUCK" ~ "RED_TO_WHITE",
      FLEET_TYPE == "WHITE TRUCK" ~ "WHITE_TO_RED",
      TRUE ~ "Other"
    ),
    FLEET_STATUS = case_when(
      FLEET_TYPE == "RED TRUCK" & NEW_FLEET == "WHITE TRUCK" ~ "Red to White",
      FLEET_TYPE == "RED TRUCK" & NEW_FLEET == "RED TRUCK" ~ "Stayed Red",
      FLEET_TYPE == "WHITE TRUCK" & NEW_FLEET == "RED TRUCK" ~ "White to Red",
      FLEET_TYPE == "WHITE TRUCK" & NEW_FLEET == "WHITE TRUCK" ~ "Stayed White",
      TRUE ~ "Other"
    ),
    FLEET_STATUS = factor(FLEET_STATUS, levels = fleet_status_levels),
    CHAIN_MEMBER = as.factor(CHAIN_MEMBER)
  )

# Summarize with CHAIN_MEMBER
fleet_change_summary <- fleet_change_data %>%
  filter(FLEET_ORIGIN %in% c("RED_TO_WHITE", "WHITE_TO_RED")) %>%
  group_by(FLEET_ORIGIN, CHAIN_MEMBER, FLEET_STATUS) %>%
  summarise(Num_Customers = n(), .groups = "drop") %>%
  group_by(FLEET_ORIGIN, CHAIN_MEMBER) %>%
  mutate(
    Percentage = Num_Customers / sum(Num_Customers) * 100,
    Label = paste0(Num_Customers, " (", round(Percentage, 1), "%)")
  )

# Define custom colors
fleet_colors <- c(
  "Red to White" = "#D3D3D3",
  "White to Red" = "#B33951",
  "Stayed Red" = "#B33951",
  "Stayed White" = "#D3D3D3"
)

# Plot function
plot_fleet_change <- function(origin) {
  ggplot(fleet_change_summary %>% filter(FLEET_ORIGIN == origin), 
         aes(x = FLEET_STATUS, y = Num_Customers, fill = FLEET_STATUS)) +
    geom_col(show.legend = FALSE, width = 0.45) +
    geom_text(aes(label = Label), vjust = -0.5, size = 3.5) +
    scale_fill_manual(values = fleet_colors) +
    facet_wrap(~ CHAIN_MEMBER, labeller = label_both) +
    scale_y_continuous(limits = c(0, 15000)) +  # Y scale set to 0–15000
    labs(
      title = paste("Fleet Transition:", gsub("_", " ", origin)),
      x = "",
      y = "Number of Customers"
    ) +
    theme_minimal() +
    theme(
      axis.text.x = element_text(size = 10),
      plot.title = element_text(face = "bold"),
      panel.grid.major.x = element_blank(),  # Remove vertical grid lines
      panel.grid.minor.x = element_blank()
    )
}

# Plots
plot_fleet_change("RED_TO_WHITE")

Code
plot_fleet_change("WHITE_TO_RED")

Among customers who are chain members (CHAIN_MEMBER = 1) and who, based on the 400-gallon threshold, should be served by red trucks, 17% would now be served by white trucks instead. This shift raises the question of whether there could be a negative impact due to the inconsistent service model within the same customer group.

In parallel, 16% of customers who should be served by white trucks under the same threshold would now be served by red trucks. This inversion in fleet assignment suggests a possible misalignment with the intended operational segmentation, and should be further evaluated to ensure customer experience and operational efficiency are not compromised.

10.5 Impact on Customer Segments (clusters)

Below is the visualization of customers by cluster who would change their fleet assignment based on their consumption and number of orders.

Code
# Define custom colors for the fleet and clusters
palette_fleet <- c(
  "RED TRUCK" = "#B33951",  # Red truck
  "WHITE TRUCK" = "#D3D3D3"  # White truck
)

palette_clusters <- c(
  "Cluster 1" = "#FF6347",  # Coral
  "Cluster 2" = "#4682B4",  # Cornflower blue
  "Cluster 3" = "#FFD700"   # Yellow
)

# Create a data frame for the threshold line
threshold_line <- data.frame(
  x = c(1, 500),
  y = c(400, 400),
  type = "400 Gallons Threshold"
)

# Filter only customers with CHANGED_FLEET == "Yes"
changed_fleet_data <- full_data_customer %>%
  filter(CHANGED_FLEET == "Yes") %>%
  filter(NEW_FLEET %in% c("RED TRUCK", "WHITE TRUCK")) %>%
  mutate(Fleet_Type = NEW_FLEET)

# Define a custom labeller for the clusters
custom_labeller <- labeller(
  CLUSTER = c(
    "1" = "Cluster 1",
    "2" = "Cluster 2",
    "3" = "Cluster 3"
  )
)

# Create scatter plot with log scales and no background color for facet labels
ggplot(changed_fleet_data) +
  geom_jitter(aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color = Fleet_Type),
              alpha = 0.5, width = 0.2) +  # Jitter to avoid overplotting
  geom_line(data = threshold_line, 
            aes(x = x, y = y, linetype = type), 
            color = "red", size = 1) +
  scale_x_log10(limits = c(1, 500)) +  # Log scale for x-axis with specific limits
  scale_y_log10(
    limits = c(10, 1000000),
    breaks = c(10, 100, 1000, 10000, 100000, 1000000),
    labels = scales::comma
  ) + 
  scale_color_manual(values = palette_fleet) +
  scale_linetype_manual(values = "solid", name = "") + 
  labs(
    title = "Customers who changed truck assignments by Cluster" ,
    x = "Number of Orders (Log Scale)",
    y = "Average Annual Consumption (Log Scale)"
  ) +
  facet_wrap(~ CLUSTER + Fleet_Type, scales = "fixed", nrow = 1, labeller = custom_labeller) +
  theme_minimal() +
  theme(
    text = element_text(size = 12),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    strip.text = element_text(size = 9),  # Adjust text size for facet labels
    strip.background = element_blank(),  # Remove background color from facet labels
    panel.grid.major.y = element_line(color = "gray90", linetype = "solid", size = 0.3),
    panel.grid.major.x = element_line(color = "gray90", linetype = "solid", size = 0.3),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white", color = "white"),
    legend.position = "bottom",
    legend.box = "vertical"
  ) +
  guides(color = "none")  # Remove legend for Fleet_Type

Out of the 425 customers who would change their fleet assignment:

These customers represent 9.3% of the total volume.

  • Cluster Breakdown:

1,273 customers from Cluster 1 will now be served by red trucks.

1,188 customers from Cluster 2 switched from white trucks to red trucks, reflecting high potential, recency, and order frequency.

  • Additionally:

1,748 customers from Cluster 2 switched from red trucks to white trucks.

26 customers from Cluster 3 switched from red trucks to white trucks.

Customer Segmentation and Cold Drink Channel

Among the customers who would change fleet assignments, the majority belong to the Dining segment (52%), where 962 would switch from red trucks to white trucks, and 1,229 would switch from white trucks to red trucks.

The second-largest segment with changes is GOODS (19%), where 251 customers would switch from red trucks to white trucks, and 549 would switch from white trucks to red trucks.

The EVENT segment (9%) would have 226 customers switching from red trucks to white trucks, while 159 would switch from white trucks to red trucks.

Code
# Create a data frame for the threshold line
threshold_line <- data.frame(
  x = c(1, 500),
  y = c(400, 400),
  type = "400 Gallons Threshold"
)

# Define colors for NEW_FLEET
fleet_colors <- c(
  "RED TRUCK" = "#B33951",
  "WHITE TRUCK" = "#D3D3D3"
)

# Filter data for changed fleet
filtered_data <- full_data_customer %>%
  filter(CHANGED_FLEET == "Yes")

# Calculate the number of unique customers
unique_customers <- filtered_data %>%
  summarise(unique_customers = n_distinct(CUSTOMER_NUMBER))
#print(unique_customers)

# Create scatter plot colored by NEW_FLEET
ggplot(filtered_data) +
  geom_jitter(aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color = NEW_FLEET),
              alpha = 0.5, width = 0.2) +
  geom_line(data = threshold_line, 
            aes(x = x, y = y, linetype = type), 
            color = "red", size = 1) +
  scale_x_log10(limits = c(1, 500)) +
  scale_y_log10(
    limits = c(10, 100000),
    breaks = c(10, 100, 1000, 10000, 100000),
    labels = scales::comma
  ) + 
  scale_color_manual(values = fleet_colors) +
  scale_linetype_manual(values = "solid", name = NULL) +
  labs(
    title = "Customers Who Changed Fleet Assignment",
    subtitle = "Average Annual Consumption vs. Number of Orders by Cold Drink Channel",
    x = "Number of Orders (Log Scale)",
    y = "Average Annual Consumption (Log Scale)"
  ) +
  facet_wrap(~ COLD_DRINK_CHANNEL, scales = "fixed") +
  theme_minimal() +
  theme(
    text = element_text(size = 12),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    strip.text = element_text(size = 10),
    panel.grid.major.y = element_line(color = "gray90", linetype = "solid", size = 0.3),
    panel.grid.major.x = element_line(color = "gray90", linetype = "solid", size = 0.3),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white", color = "white"),
    legend.position = "bottom"
  )

Code
# Define colors based on NEW_FLEET
fleet_colors <- c(
  "RED TRUCK" = "#B33951",    # Red
  "WHITE TRUCK" = "#D3D3D3"   # Light gray
)

# Filter the data for LOCAL_FOUNT_ONLY == 1 and CHANGED_FLEET == "Yes"
filtered_local_fount <- full_data_customer %>%
  filter(LOCAL_FOUNT_ONLY == 1, CHANGED_FLEET == "Yes")

# Threshold line for 400 gallons
threshold_line <- data.frame(
  x = c(1, 500),
  y = c(400, 400),
  type = "400 Gallons Threshold"
)

# Updated plot
ggplot(filtered_local_fount) +
  geom_jitter(aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color = NEW_FLEET),
              alpha = 0.6, width = 0.2) +
  geom_line(data = threshold_line,
            aes(x = x, y = y, linetype = type),
            color = "red", size = 1) +
  scale_x_log10(limits = c(1, 500)) +
  scale_y_log10(
    limits = c(10, 100000),
    breaks = c(10, 100, 1000, 10000, 100000),
    labels = scales::comma
  ) +
  scale_color_manual(values = fleet_colors) +
  scale_linetype_manual(values = "solid", name = NULL) +
  labs(
    title = "Local Fountain Only Customers Who Changed Fleet Assignment",
    subtitle = "Average Annual Consumption vs. Number of Orders by Cold Drink Channel",
    x = "Number of Orders (Log Scale)",
    y = "Average Annual Consumption (Log Scale)",
    color = "New Fleet"
  ) +
  facet_wrap(~ COLD_DRINK_CHANNEL, scales = "fixed") +
  theme_minimal() +
  theme(
    text = element_text(size = 12),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    strip.text = element_text(size = 10),
    panel.grid.major.y = element_line(color = "gray90", linetype = "solid", size = 0.3),
    panel.grid.major.x = element_line(color = "gray90", linetype = "solid", size = 0.3),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white", color = "white"),
    legend.position = "bottom"
  )

Code
# List all variables in the environment
all_vars <- ls()

# Exclude 'full_data', 'full_data_customer', and the new variables from removal
vars_to_keep <- c("full_data", "full_data_customer","mydir", "one_seed", "reference_date")

# Get the variables to remove
vars_to_remove <- setdiff(all_vars, vars_to_keep)

# Remove the temporary data frames
rm(list = vars_to_remove)

# Clean up by removing 'all_vars' and 'vars_to_remove'
rm(all_vars, vars_to_remove)

Among the local market partners with fountain drink only, nearly 90% of the fleet changes would occur in the Dining segment. In this group, 85 customers would switch from white trucks to red trucks, and 45 would switch from red trucks to white trucks.

11. Business Value and Final Conclusions

The proposed fleet reassignment strategy has the potential to save approximately $770,000 for the company over the past two years by increasing the number of customers served by red trucks, optimizing their usage frequency, and reducing their volume by 3%, which would allow for the eventual redeployment to strategic customers.

The proposal was quite conservative, redesigning the delivery method for only 14% of the customers and was able to assign the fleet not only based on volume but on several intrinsic customer characteristics. Therefore, the expectation is that after its implementation, there will be gains not only in cost reduction but also in sales increase, mainly for customers with greater growth potential. In addition, the proposal allowed the identification of three main customer groups, two of which showed good homogeneity.

When measuring the impacts of the new fleet assignment, the dining segment was the most impacted by these changes, particularly for the local market partners classified as fountain only. There was no significant impact on the activities of sales representatives, but there was a significant impact in reducing the volumes delivered by red trucks (-20%) when orders are placed through call centers, which is actually a good outcome since orders through call centers no longer had a strong relationship with customers.

A differentiator for the process was the feature engineering, which brought robustness to the clustering. The supervised models, Decision Tree and Multinomial Logistic Regression, were very important in explaining the variables that influenced the clusters and, with their accuracy being raised (close to 90%), they have the potential to predict segments for new customers.

Limitations, Improvements, and Lessons

One of the main limitations of this project was the short two-year historical data, which made it difficult to predict the future impact of the recommendation. Analytical approaches were challenging due to the wide probability ranges, meaning that any outcome was possible.

Another challenge was the asynchrony between customer orders, which made it hard to track individual customer growth tied to specific times of the year. With a longer historical series, we could have made more accurate future predictions.

The census data could have been better utilized. The way it was applied in this project didn’t deliver the expected results, but with adjustments and more historical data, it could provide valuable insights for future analysis.

It’s clear that predicting future growth, even with extensive data, is a complex task. These predictions should only be emphasized if the process is robust, with strong statistical support and a consistent range of possible outcomes. Otherwise, it might be better to refrain from highlighting them.

Looking ahead, I strongly recommend conducting further tests to measure the impact of fleet allocation and the way customers place orders. This will be crucial in validating or refining the current approach. Additionally, analyzing revenue could provide deeper business insights, especially in understanding margins across different customer segments.

A key takeaway from this project is that data doesn’t always provide all the answers we need for decision-making. In these cases, history shows that there will be both successes and setbacks, but decisions still need to be made. My role was to make responsible recommendations and take a clear stance, even when faced with uncertainties.